AD-A221  871 


ETL-0564 


Parallel  Algorithms 

for  Computer  Vision 

Final  Report 


Tomaso  Poggio  \ 

Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 
545  Technology  Square 
Cambridge,  Massachusetts  02139 


April  1990 


Approved  for  public  release;  distribution  is  unlimited. 


Prepared  for: 

Defense  Advanced  Research  Projects  Agency 

a  /- r\r>  \  a i:i r>  .1 

IW  VVIlSUII  DUUit?VdlU 

Arlington,  Virginia  22209-2308 

ILS.  Army  Corps  of  Engineers 
Engineer  Topographic  Laboratories 
Fort  Belvoir,  Virginia  22060-5546 

g0  05  14  001 


REPORT  DOCUMENTATION  PAGE 


form  Approved 
OMB  NO.  0704*01 2* 


Public  r'jporttrvg  burden  for  this  collection  of  mformitton  tj  to  m'age  t  Kouf  c*f  fe^oonw.  including  tb«  tier*  for  reviewing  injtructiont.  torching  e*«ting  d*t»  sources, 

gathering  ind  m*mt  lining  ih%  <3iU  needed.  ind  completing  ind  reviewing  the  collection  of  information  Send  comment*  regarding  thu  burden  estimate  or  any  other  aipecl  of  thii 
collection  of  mformat.or,  including  iugontiont  for  reducing  tht%  burden,  to  Wiihmgion  Headquarters  Services.  Directorate  for  information  Operations  and  Reports.  121$  Jefferson 
Davis  Highway.  Suite  U04.  Arlington,  vA  22202*^302.  and  to  the  Office  of  Management  and  Budget.  Paperwork  Reduc  on  ft  oject  (0  704  -0111),  Washing  ton,  DC  20501. 


1.  AGENCY  USE  ONLY  (Leave  blank) 


4.  TITLE  AND  SUBTITLE 


2.  REPORT  DATE 

16  April  1990 


3.  REPORT  TYPE  AND  DATES  COVERED 

Final  Annual  31  Aug  88  -  31  Jan  90 


5.  FUNDING  NUMBERS 


Parallel  Algorithms  for  Computer  Vision  -  Final  Report 


6.  AUTHOR(S) 

Poggio,  Tomaso 


/.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADORESS(ES) 

Massachusetts  Institute  of  Technology 
Artificial  Intelligence  Laboratory 
545  Technology  Square 
Cambridge,  MA  02139 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

DARPA  _  TJS  Army  Engineer  Topo  Labs 

1400  Wilson  Boulevard  Research  institute 

Arlington,  VA  22209-2308  Fort  Belvoir,  VA  22060-5546 


(C)  DACA7 6-85-C-0010 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


10*  SPONSORING/MONITORING 
AGENCY  REPORT  NUMBER 


ETL-0564 


11.  SUPPLEMENTARY  NOTES  This  subject  was  previously  discussed  m: 

ETL-0456  Parallel  Algorithms  for  Computer  Vision  Jan  1987  AD-A183  755 

ETL-0495  "  "  "  "  ",  Second  Year  Report  Mat  1988  AD-A203  947 

ETL-0529  "  "  "  "  ",  Third  Year  Report  Jan  1989  AD-A212  489 


12a.  DISTRIBUTION/AVAILABILITY  STATEMENT  12b.  DISTRIBUTION  CODE 

Approved  for  public  release; 

Distribution  unlimited* 


13.  ABSTRACT  (Maximum  700  words) 

The  main  effort  in  this  project  has  been  directed  towards  the  development  of  an 
integrated  vision  system  -  the  Vision  Machine  -  based  on  a  parallel  supercomputer. 
The  core  of  the  Vision  Machine  is  in  fact  a  set  of  parallel  algorithms  for  visual 
recognition  and  navigation  in  an  unstructured  environment.  The  present  version 
of  the  Vision  Machine  has  been  demonstrated  to  process  images  in  close  to  real 
time  by  (1)  computing  first  several  low-level  cues,  such  as  edges,  stereo  disparity, 
optical  flow,  color  and  texture,  (2).  Integrating  them  to  extract  a  cartoon-like 
description  of  the  scene  in  terms  of  the  physical  discontinuities  of  surfaces, 
and  (3)  using  this  cartoon  in  a  recognition  stage,  based  on  parallel  model  matching. 
In  addition  to  the  development  of  the  parallel  algorithms,  their  implementation 
and  testing,  we  have  also  done. substantial  work  in  several  areas  that  are  very 

cl  OS°1  V  r*  0 1  o  t*  o  rl  G  *  U-*.  4  n  ♦-  4  ^  C  UT  QT  M  4  «•  ,.  *- ,* 

**  w  ' *  *  V-  ■w  O  kl  QUU  i.OUtxCaUi.Ull  V  AjZ>  jl  i.xtuuitS  I_U 

transfer  to  potentially  cheap  and  fast  hardware  some  of  the  software  algorithms, 

(2)  initial  development  of  techniques  to  synthesize  by  learning  vision  algorithms, 
and  (3)  several  projects  involving  autonomous  navigation  of  small  robots,  f  \<rp  )  i~ 


14.  SUBJECT  TERMS 

Computer  vision 
Parallel  algorithms 


17.  SECURITY  CLASSIFICATION 
OF  REPORT 

Unclassified 


Connection  Machine 


IB.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 

Unclassified 


19.  SECURITY  CLASSIFICATION 
OF  ABSTRACT 

Unclassified 


15.  NUMBER  OF  PAGES 

62 


16.  PRICE  CODE 


20.  LIMITATION  OF  ABSTRACT 


NSN  7540-01*280*5500 

Standard  Form  298  (Rev.  2*89) 

Pf^ibfrd  by  ANSI  $td,  239*14 

- 

298*102 

GENERAL  INSTRUCTIONS  FOR  COMPLETING  SF  298 

The  Report  Documentation  Page  (RDP)  is  used  in  announcing  and  cataloging  reports.  It  is  important 
that  this  information  be  consistent  with  the  rest  of  the  report,  particularly  the  cover  and  title  page. 
Instructions  for  filling  in  each  block  of  the  form  follow.  It  is  important  to  stay  within  the  lines  to  meet 
optical  scanning  requirements. 


Block  1. 


Use  Only  {leave  blank) 


Block  2.  Report  Date.  Full  publication  date 
including  day,  month,  and  year,  if  available  (e.g.  1 
Jan  88).  Must  cite  at  least  the  year. 

Block  3.  Type  of  Report  and  Dates  Covered. 
State  whether  report  is  interim,  final,  etc.  If 
applicable,  enter  inclusive  report  dates  (e.g.  10 
Jun  87  *  30  Jun  88). 

Block  4.  Title  and  Subtitle.  A  title  is  taken  from 


the  part  of  the  repori.  that  provides  the  most 
meaningful  and  complete  information.  When  a 
report  is  prepared  in  more  than  one  volume, 
repeat  the  primary  title,  add  volume  number,  and 
include  subtitle  for  the  specific  volume.  On 
classified  documents  enter  the  title  classification 
in  parentheses. 

Blocks.  Funding  Numbers.  To  include  contract 


and  grant  numbers;  may  include  program 
element  numbers),  project  number(s),  task 
number(s),  and  work  unit  number(s).  Use  the 
following  labels: 

C  -  Contract  PR  *  Project 

G  -  Grant  TA  -  Task 

PE  -  Program  WU  -  Work  Unit 

Element  Accession  No. 

Block  6.  Author(s).  Name(s)ofperson(s) 
"'"'*>nsibh  for  writing  the  report,  performing 
ch,  or  credited  with  the  content  of  the 
i-ipOi .  ditor  or  compiler,  this  should  follow 
the  name(s). 

Block  7.  Performing  Organization  Name(s)  and 


Address(es).  Self-explanatory. 

Block  8.  Performing  Organization  Reoort 


Number.  Enter  the  unique  alphanumeric  report 
number(s)  assigned  by  the  organization 
performing  the  report. 

Block  9.  Sponsorinci/Monitorino  Agency  Name(s) 
and  Addressfes).  Self-explanatory. 

Block  10.  Sponsoring/Monitoring  Agencv 


Report  Number,  (if  known) 

Block  11.  Supplementary  Notes.  Enter 
information  not  included  elsewhere  such  as: 
Prepared  in  cooperation  with...;  Trans,  of...;  To  be 
published  in....  When  a  report  is  revised,  include 
a  statement  whether  the  new  report  supersedes 
or  supplements  the  older  report. 


Block  12a.  Distribution/Availabilitv  Statemen 


Denotes  public  availability  or  limitations.  Cite  any 
availability  to  the  public.  Enter  additional 
limitations  or  special  markings  in  all  capitals  (e.g. 
NOFORN,  REL,  ITAft). 


See  DoDD  5230.24,  'Distribution 
Statements  on  Technical 
Documents.' 

See  authorities. 

See  Handbook  NHB  2200.2. 
Leave  blank. 


DOE 

NASA 

NT1S 


Block  12b.  Distribution  Code. 


NASA- 
NT1S  - 


Leave  blank. 

Enter  DOE  distribution  categories 
from  the  Standard  Distribution  for 
Unclassified  Scientific  and  Technical 
Reports. 

Leave  blank. 

Leave  blank. 


Block  13.  Abstract.  Include  a  brief  (Maximum 
200  words)  factual  summary  of  the  most 
significant  information  contained  in  the  report. 

Block  14.  Subject  Terms.  Keywords  or  phrases 
identifying  major  subjects  in  the  report. 

Block  15.  Number  of  Pages.  Enter  the  total 


number  of  pages. 

Block  16.  Price  Code.  Enter  appropriate  price 
code  (NTI5  only). 

Blocks  17.  •  19.  Security  Classifications.  Self- 


explanatory.  Enter  U.S.  Security  Classification  in 
accordance  with  U.S.  Security  Regulations  (i.e., 
UNCLASSIFIED).  If  formcontainsdassified 

*  nf atiyi  ati  An  ifamrs  /•!  aa  fkn  4  a  a  a/4 

•  ••«  vi  liiwiviii  <rvvm^  mOjji  ii\uvivi  •  vi  i  v  iv  svp  VI  IV 

bottom  of  the  page. 


Block  20.  Limitation  of  Abstract.  This  block  must 


be  completed  to  assign  a  limitation  to  the 
abstract.  Enter  either  UL  (unlimited)  or  SAR  (same 
as  report).  An  entry  in  this  block  is  necessary  if 
the  abstract  is  to  be  limited.  If  blank,  the  abstract 
is  assumed  to  be  unlimited. 


Standard  Form  298  Back  (Rev.  2-89) 


Preface 


This  report  was  prepared  under  Contract  DACA76-85-C-0010  for  the  U.S. 
Army  Engineer  Topographic  Laboratories,  Fort  Belvoir,  Virginia  22060-5546  by 
Massachusetts  Institute  of  Technology,  Cambridge,  Massachusetts.  The  Contracting 
Officer’s  Representative  was  George  Lukes. 


Accession  For 

NTIS  GRA&I 

DTIC  TAB 
Unannounced 
.Tiictf.  1  f  1  nflt  1  nn 

m r 

□ 

□ 

| 

Rv 

Distribution/ 

Availability  Codes 

Dist 

it 

Avail  and/or 
Special 

Contents 


1  Overview  4 

2  The  Vision  Machine  5 

2.1  Introduction:  The  Vision  Machine  Project .  5 

2.2  The  Vision  Machine  System .  6 

2.3  Hardware .  3 

2.3.1  The  Eye-Head  System .  8 

2.3.2  Our  Computational  Engine:  The  Connection  Machine .  9 

2.4  Early  Vision  Algorithms  and  their  Parallel  Implementation .  12 

2.4.1  Edge  Detection .  12 

2.4.2  Stereo .  14 

2.4.3  Motion  .  18 

2.4.4  Color  . 20 

2.4.5  Texture . 21 

2.4.6  The  Integration  Stage  and  MRF . 22 

2.5  Illustrative  Results .  25 

2.6  Recognition . 26 

2.6.1  Learning  in  a  three-stage  recognition  scheme . 26 

2.7  Future  Developments . 27 

3  VLSI  29 

3.0.1  A  VLSI  Vision  Machine?  . 29 

4  Learning  30 

4.1  Radial  Basis  Functions . 31 

4.2  An  extension:  Generalized  Radial  Basis  Functions  . 32 

4.3  RBF  are  equivalent  to  regularization . 33 

5  Other  Work  35 


5.1  Labeling  the  physical  origin  of  edges:  computing  qualitative  surface  attributes  ....  35 


2 


5.2  Saliency,  grouping  and  segmentation .  36 

5.2.1  Saliency  Measure . 36 

5.2.2  T  Junctions:  Their  Detection  and  Use  in  Grouping . 36 

5.3  Fast  Vision:  The  Role  of  Time  Smoothness . 37 

5.4  Parameter  Estinia':on  in  the  MRF  integration  stage . 38 

o.5  Object  Recognition .  39 

5.5.1  Recognition  from  Matched  Dimensionalities . 40 


v 


3 


1  Overview 


Tlte  main  effort  in  the  project  has  been  directed  towards  the  development  of  an  integrated  vision 
system  -  the  Vision  Machine  -  ,  based  on  a  parallel  supercomputer.  The  core  of  the  Vision  Machine 
is  in  fact  a  set  of  parallel  algorithms  for  visual  recognition  and  navigation  in  an  unstructured 
environment.  The  present  version  of  the  Vision  Machine  has  been  demonstrated  to  process  images 
in  close  to  real  time,  by 

1.  computing  first  several  low-level  cues,  such  as  edges,  stereo  disparity,  optical  flow,  color  and 
texture, 

2.  integrating  them  to  extract  a  cartoon-like  description  of  the  scene  in  terms  of  the  physical 
discontinuities  of  surfaces, 

3.  using  this  cartoon  in  a  recognition  stage,  based  on  parallel  model  matching. 

In  addition  to  the  development  of  the  parallel  algorithms,  their  implementation  and  testing, we  have 
also  done  substantial  work  in  several  areas  that  are  very  closely  related: 

•  design  and  fabrication  of  VLSI  circuits  -  analog  and  digital  -  to  transfer  to  potentially 
cheap  and  very  fast  hardware  some  of  the  software  algorithms  of  the  Vision  Machine, 

•  initial  development  of  techniques  to  synthesize  by  learning  vision  algorithms  or  improve 
them  with  the  use  of  pertinent  examples, 

•  several  projects  involving  autonomous  navigation  of  small  robots,  recognition  techniques 
and  computation  of  salient  contours. 

In  the  'ollowing  we  will  provide  background  information  on  all  of  these  items.  Additional  details 
can  be  found  in  the  references  cited. 
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2  The  Vision  Machine 


2.1  Introduction:  The  Vision  Machine  Project 


Computer  vision  has  developed  algorithms  for  several  early  vision  processes,  such  as  edge  detection, 
stereopsis,  motion,  texture,  and  color,  which  give  separate  cues  as  to  the  distance  from  the  viewer 
of  three-dimensional  surfaces,  their  shape,  and  their  material  properties.  Biological  vision  systems, 
however,  greatly  outperform  computer  vision  programs.  It  is  clear  that  one  of  the  keys  to  the 
reliability,  flexibility,  and  robustness  of  biological  vision  systems  in  unconstrained  environments 
is  their  ability  to  integrate  many  different  visual  cues.  For  this  reason,  we  have  developed  and 
continue  to  develope  a  Vision  Machine  system  to  explore  the  issue  of  integration  of  early  v  ision 
modules.  The  system  also  serves  the  purpose  of  developing  parallel  vision  algorithms,  since  its 
main  computational  engine  is  a  parallel  supercomputer,  the  Connection  Machine. 

The  idea  behind  the  Vision  Machine  is  that  the  main  goal  of  the  integration  stage  is  to  compute 
a  map  of  the  visible  discontinuities  in  the  scene,  somewhat  similar  to  a  cartoon  or  a  line-drawing. 
There  are  several  reasons  for  this.  First,  experience  with  existing  model-based  recognition  algo¬ 
rithms  suggest  that  the  critical  problem  in  this  type  of  recognition  is  to  obtain  a  reasonably  good 
map  of  the  scene  in  terms  of  features  such  as  edges  and  corners.  The  map  does  not  need  to  be  per¬ 
fect  (human  recognition  works  with  noisy  and  occluded  line  drawings)  and,  of  course,  it  cannot  be. 
But  it  should  be  significantly  cleaner  than  the  typical  map  provided  by  an  edge  detector.  Second, 
discontinuities  of  surface  properties  are  the  most  important  locations  in  a  scene.  Third,  we  have 
argued  that  discontinuities  are  ideal  for  integrating  information  from  different  visual  cues. 

It  is  also  clear  that  there  are  several  differej  c  approaches  to  the  problem  of  how  to  integrate  visual 
cues.  Let  us  list  some  of  the  obvious  possibilities: 

1)  There  is  no  active  integration  of  visual  processes.  Their  individual  outputs  are  “integrated" 
at  the  stage  at  which  they  are  used,  for  example  by  a  na’.igation  system.  This  is  the  approach 
advocated  by  Brooks  (1987).  While  it  makes  sense  for  automatic,  insect-like,  visuo-motor  tasks 
such  as  tracking  a  taiget  or  avoiding  obstacles  (e.g.,  the  fly’s  visuo-motor  system  (Poggio  and 
Reichardt,  1976)),  it  seems  quite  unlikely  for  visual  perception  in  the  wide  sense. 

2)  The  visual  modules  are  so  tightly  coupled  that  it  is  impossible  to  consider  visual  modules 
as  separate,  even  in  a  first  order  approximation.  This  view  is  unattractive  on  epistemological, 
engineering  and  psychophysical  grounds. 

3)  The  visual  modules  are  coupled  to  each  other  and  to  the  image  data  in  a  parallel  fashion  -  each 
process  represented  as  an  array  coupled  to  the  arrays  associated  with  the  other  processes.  This 
point  of  view  is  in  the  tradition  of  Marr’s  2  j-D  sketch,  and  especially  of  the  “intrinsic  images”  of 
Barrow  and  Tenenbaum  (1978).  Our  present  scheme  is  of  this  type,  and  exploits  the  machinery  of 


5 


Markov  Random  Field  (MRF)  models. 

-1)  Integration  of  different  vision  modalities  is  taking  place  in  a  task-dependent  way  at  specific 
locations  -  not  over  the  whole  image  -  and  when  it  is  needed  -  therefore  not  at  all  times.  This 
approach  is  suggested  by  psychophysical  data  on  visual  attention  and  by  the  idea  of  visual  routines 
(Ullman,  1984;  see  also  Hurlbert  and  Poggio,  1986;  Mahoney,  1986;  Buelthoff  and  Mallot,  1987). 

We  have  actively  explored,  in  the  framework  of  the  contract  Parallel  Vision  Algorithms,  the  third 
of  these  approaches.  We  believe  that  the  last  two  approaches  are  compatible  with  each  other.  In 
particular,  visual  routines  may  operate  on  maps  of  discontinuities  such  as  those  delivered  by  the 
present  Vision  Machine,  and  therefore  be  located  after  a  parallel,  automatic  integration  stage.  In 
real  life,  of  course,  it  may  be  more  a  matter  of  coexistence.  We  believe,  in  fact,  that  a  control 
structure  based  on  specific  knowledge  about  the  properties  of  the  various  modules,  the  specific 
scene  and  the  specific  task  will  be  needed  in  a  later  version  of  the  Vision  Machine  to  overview  and 
control  the  MRF  integration  stage  itself  and  its  parameters.  It  is  possible  that  the  integration  stage 
should  be  much  more  goal-diiected  that  what  our  present  methods  (MRF  based)  allow.  The  main 
goal  of  our  work  is  to  find  out  whether  this  is  true. 

The  Vision  Machine  project  had  a  number  of  goals.  It  provided  a  focus  for  developing  parallel 
vision  algorithms  and  for  studying  how  to  organize  a  real-time  vision  system  on  a  massively  parallel 
supercomputer.  It  attempts  to  alter  the  usual  paradigm  of  computer  vision  research  over  the  past 
years:  choose  a  specific  problem,  ror  example  stereo,  find  an  algorithm,  and  test,  it  in  isolation.  The 
Vision  Machine  has  allowed  us  to  develop  and  test  an  algorithm  in  the  context  of  the  other  modules 
and  the  requirements  of  the  overall  visual  task,  above  all  visual  recognition.  For  this  reason,  the 
project  was  more  than  an  experiment  in  integration  and  parallel  processing:  it  was  and  still  is  a 
laboratory  for  our  theories  and  algorithms. 

Finally,  the  ultimate  goal  of  the  Vision  Machine  project  is  no  less  than  the  ultimate  goal  of  vision 
research:  to  build  a  vision  system  that  achieves  human-level  performance. 


2.2  The  Vision  Machine  System 

The  overall  organization  of  the  system  is  shown  in  Figure  1.  The  image(s)  are  processed  in  parallel 
through  independent  algorithms  or  modules  corresponding  to  different  visual  cues.  Edges  are 
extracted  using  Canny’s  edge  detector.  The  stereo  module  computes  disparity  from  the  left  and 
right  images.  The  motion  module  estimates  an  approximation  of  the  optical  flow  from  pairs  of 
images  in  a  time  sequence.  The  texture  module  computes  texture  attributes  (such  as  density 
and  orientation  of  teutons  (see  Voorhees,  1987)).  The  color  algorithm  provides  an  estimate  of 
the  spectral  albedo  of  the  surfaces,  independently  of  the  effective  illumination,  that  is,  illumination 
gradients  an  l  shading  effects,  as  suggested  by  Hurlbert  and  Poggio  (see  Hurlbert  and  Poggio,  1985). 
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Figure  1:  Overall  organization  of  the  Vision  Machine. 


The  measurements  provided  by  the  early  vision  modules  are  typically  noisy,  and  possibly  sparse 
(for  stereo  and  motion).  They  are  smoothed  and  made  dense  by  exploiting  known  constraints 
within  each  process  (for  instance,  that  disparity  is  smooth).  This  is  the  stage  of  approximation  and 
restoration  of  data,  performed  using  a  Markov  Random  Field  model.  Simultaneously,  discontinuities 
are  found  in  each  cue.  Prior  knowledge  of  the  behavior  of  discontinuities  is  exploited,  for  instance, 
the  fact  that  they  are  continuous  lines,  not  isolated  points.  Detection  of  discontinuities  is  aided  by 
the  information  provided  by  brightness  edges.  Thus  each  cue,  disparity,  optical  flow,  texture,  and 
color,  is  coupled  to  the  edges  in  brightness. 

The  full  scheme  involves  finding  the  various  types  of  physical  discontinuities  in  the  surfaces,  depth 
discontinuities  (extremal  edges  and  blades),  orientation  discontinuities,  specular  edges,  albedo  edges 
(or  marks),  and  shadow  edges,  and  coupling  them  with  each  other  and  back  to  the  discontinuities  in 
the  visual  cues  (as  illustrated  in  Figure  1  and  suggested  by  Geiger  and  Weinshall,  1988  and  Gamble, 
Geiger,  Poggio  and  Weinshall,  1989).  So  far  we  have  implemented  only  the  coupling  of  brightness 
edges  to  each  of  the  cues  provided  by  the  early  algorithm.  As  we  will  discuss  later,  the  technique  we 
use  to  approximate,  to  simultaneously  detect  discontinuities,  and  to  couple  the  different  processes, 
is  based  on  MRF  models.  The  output  of  the  system  is  a  set  of  labeled  discontinuities  of  the 
surfaces  around  the  viewer.  Thus  the  scheme  -  an  instance  of  inverse  optics  -  computes  surface 
properties,  that  is  attributes  of  the  physical  world  and  not  anymore  of  the  images.  Notice  that  we 
attempt  to  find  discontinuities  in  surface  properties  and  therefore  qualitative  surface  properties: 
the  inverse  optics  paradigm  does  not  imply  that  physical  properties  of  the  surfaces,  such  as  depth 
or  reflectance,  should  be  extracted  precisely,  everywhere.  These  discontinuities,  taken  together, 
represent  a  “cartoon”  of  the  original  scene  which  can  be  used  for  recognition  and  navigation  (along 
with,  if  needed,  interpolated  depth,  motion,  texture  and  color  fields).  As  yet  we  did  not  integrate 
our  ongoing  work  on  grouping  in  the  Vision  Machine.  We  expect  to  use  a  saliency  operation  on  the 
output  of  the  edge  detection  process  possibly  before  the  use  of  intensity  edges  by  the  MRF  stage. 
The  grouping  based  on  T-junctions  (Beymer,  in  preparation)  should  take  place  on  the  intensity 
edges  at  the  same  level  as  the  MRF  stage.  Initial  work  in  recognition  has  been  integrated  in  the 
system.  The  Vision  Machine  has  been  demonstrated  working  form  images  to  recognition  through 
the  integration  of  visual  cues. 

The  plan  of  this  section  is  as  follows.  We  will  first  review  the  current  hardware  of  the  Vision 
Machine:  the  eye-head  system  and  the  Connection  Machine.  We  will  then  describe  in  some  detail 


each  of  the  early  vision  algorithms  that  are  presently  running  and  are  part  of  the  system.  After 
this,  the  integration  stage  will  be  discussed.  We  will  analyze  some  results,  and  illustrate  the  merits 
and  the  pitfalls  of  our  present  system.  The  last  chapter  will  discuss  a  real-time  visual  system,  and 
some  ideas  on  how  to  put  the  system  into  VLSI  circuits  of  analog  and  digital  type. 


2.3  Hardware 


2.3.1  The  Eye-Head  System 

Because  of  the  scope  of  the  Vision  Machine  project,  a  general  purpose  image  input  device  is  required. 
Such  a  device  is  the  eye-head  system.  Here  we  discuss  its  current  and  future  configurations. 

The  eye-head  system  consists  of  two  CCD  cameras,  which  act  as  eyes,  mounted  on  a  variable- 
attitude  platform,  which  acts  as  the  head.  Inspired  by  biology,  the  apparatus  is  configured  such 
that  the  head  moves  the  eyes  as  a  unit,  while  allowing  the  eyes  to  point  independently.  Each  eye  is 
equipped  with  a  motorized  zoom  lens  (Fl.4,  focal  length  from  12.5  to  75mm),  allowing  control  of 
the  iris,  focus,  and  focal  length  by  the  host  computer  (currently  a  Symbolics  3600  Lisp  '  'achine). 
Other  hardware  allows  for  repeatable  calibration  of  the  entire  apparatus. 

Because  of  the  size  and  weight  of  the  motorized  lenses,  it  would  be  impractical  to  achieve  eye 
movement  by  pointing  the  camera/lens  assemblies  directly.  Instead,  each  assembly  is  mounted 
rigidly  on  the  head,  with  eye  movement  achieved  indirectly.  In  front  of  each  lens  is  a  pair  of  front 
surface  mirrors,  each  of  which  can  be  pivoted  by  a  galvanometer,  providing  two  degrees  of  freedom 
in  aiming  the  cameras.  At  the  expense  of  a  more  complicated  imaging  geometry,  we  get  a  simple 
and  fast  pointing  system  for  the  eyes. 

The  head  is  attached  to  its  mount  via  a  spherical  joint,  allowing  head  rotation  about  two  orthogonal 
axes  (pan  and  tilt).  Each  axis  is  driven  by  a  stepper  motor  coupled  to  its  drive  shaft  through  a 
harmonic  drive.  The  latter  provides  a  large  gear  ratio  in  conjunction  with  very  little  mechanical 
backlash.  Under  control  of  the  stepper  motors,  the  head  can  be  panned  180  degrees  from  left 
to  right,  and  tilted  90  degrees  (from  vertical-down  to  horizontal).  Each  of  the  stepper  motors  is 
provided  with  an  optical  shaft  encoder  for  shaft  position  feedback  (a  closed-loop  control  scheme 
is  employed  for  the  stepper  motors).  The  shaft  encoders  also  provide  an  index  pulse  (one  per 
revolution)  which  is  used  for  joint  calibration  in  conjunction  with  mechani:al  limit  switches.  The 
latter  also  protect  the  head  from  damage  due  to  excessive  travel. 

The  overall  control  system  for  the  eye-head  system  is  distributed  over  a  micro-processor  network 
(UNET)  developed  at  the  MIT  AI  Laboratory  for  the  control  of  vision/robotics  hardware.  The 
UNET  is  a  “multi-drop”  network  supporting  up  to  32  micros,  under  the  control  of  a  single  host.  The 
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micros  normally  function  as  network  slaves,  with  the  host  acting  as  the  master.  In  this  mode  the 
micros  only  “speak  when  spoken  to,"  responding  to  various  network  operations  either  by  receiving 
information  (command  or  otherwise)  or  by  transmitting  information  (such  as  status  or  results). 
Associated  with  each  micro  on  the  UNET  is  a  local  16-bit  bus  (UBUS),  which  is  totally  under  the 
consol  of  the  micro.  Peripheral  devices  such  as  motor  drivers,  galvanometer  drivers,  and  pulse 
width  modulators  (PVVMs),  to  name  a  few,  can  be  interfaced  at  this  level. 

At  present,  three  micro-processors  are  installed  on  the  eye-head  UNET:  one  each  for  the  galvanome¬ 
ters.  motorized  lenses,  and  stepper  motors.  The  processors  currently  employed  are  based  on  the 
Intel  8051.  Each  of  these  micros  has  an  assortment  of  UBUS  peripherals  under  its  control.  By 
making  these  peripherals  sufficiently  powerful,  each  micro's  control  task  can  remain  simple  and 
manageable.  Code  for  the  micros,  written  in  both  assembly  language  and  C,  is  facilitated  by  a 
Lisp-based  debugging  environment. 

A  single  major  enhancement  remains  for  the  eye-head  system.  Currently,  a  Symbolics  Lisp  Machine 
acts  as  the  host  processor  for  the  UNET.  In  the  fall  of  'S9,  an  intermediate  real-time  processor  will 
be  placed  between  the  Lisp  Machine  and  the  UNET,  acting  as  master  of  the  latter.  The  real-time 
processor  (referred  to  as  the  DSP,  being  based  on  a  Digital  Signal  Processor  chip)  will  relieve  the 
Lisp  Machine  of  all  the  UNET  protocol  tasks,  as  well  as  various  low-level,  real-time  control  tasks  for 
which  the  Lisp  Machine  is  ill-suited.  Among  the  tasks  envisioned  for  the  DSP  is  optimal  position 
estimation  of  moving  targets. 


2.3.2  Our  Computational  Engine:  The  Connection  Machine 

The  Connection  Machine  is  a  powerful  fine-grained  parallel  machine  which  has  proven  useful  for 
implementation  of  vision  algorithms.  In  implementing  these  algorithms,  several  different  models 
of  using  the  Connection  Machine  have  emerged,  since  the  machine  provides  several  different  com 
munication  modes.  The  Connection  Machine  implementation  of  algorithms  can  take  advantage 
of  the  underlying  architecture  of  the  machine  in  novel  ways.  We  describe  here  several  common, 
elementary  operations  which  recur  throughout  the  following  discussion  of  parallel  algorithms. 

The  Connection  Machine 


The  CM-2  version  of  the  Connection  Mnchine  (Hillis,  1985)  is  n  pnrnile!  computing  machine  with 
between  16K  and  64K  processors,  operating  under  a  single  instruction  stream  broadcast  to  all 
processors.  It  is  a  Single  Instruction  Multiple  Data  (SIMD)  machine;  all  processors  execute  the 
same  control  stream.  Each  processor  is  a  simple  l-bit  processor,  currently  with  64I<  bits  of  memory, 
optionally  with  a  floating  point  arithmetic  accelerator,  shared  among  16  (or  32)  processors.  There 
are  two  modes  of  communication  among  the  processors:  the  NEWS  network  and  the  router.  The 
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NEWS  network  (so-called  because  the  connections  are  in  the  four  cardinal  directions)  provides 
rapid  direct  communication  between  neighboring  processors  in  an  rectangular  grid  of  arbitrary 
dimension.  For  example,  64K  processors  could  be  configured  into  a  two-dimensional  256  x  256  grid, 
or  into  a  four-dimesional  64  x  64  x  4  x  4  grid.  The  second  mode  of  communication  is  the  router, 
which  allows  messages  to  be  sent  from  any  processor  to  any  other  processor  in  the  machine.  The 
processors  in  the  Connection  Machine  can  be  envisioned  as  being  the  vertices  of  a  16-dimensional 
hypercube  (in  fact,  it  is  a  12-dimensional  hypercube;  at  each  vertex  of  the  hypercube  resides  a  chip 
containing  16  processors).  Each  processor  in  the  Connection  Machine  is  identified  by  its  hypercube 
address  in  the  range  0... 65535,  imposing  a  linear  order  on  the  processors.  This  address  denotes 
the  destination  of  messages  handled  by  the  router.  Messages  pass  along  the  edges  of  the  hypercube 
from  source  processors  to  destination  processors.  The  Connection  Machine  also  has  facilities  for 
returning  to  the  host  machine  the  result  of  various  operations  on  a  field  in  all  processors;  it  can 
return  the  global  maximum,  minimum,  sum.  logical  AND,  and  logical  OR  of  the  field. 

The  floating-point  arithmetic  accelerator,  which  may  optionally  be  added  to  the  Connection  Ma¬ 
chine.  provides  a  significant  increase  in  the  speed  of  both  single  and  double  precision  computations. 
One  floating-point  processor  chip  serves  a  pair  Connection  Machine  processor  chips  with  32  total 
processors  in  a  pipelined  fashion,  and  can  produce  a  speed-up  of  more  than  a  factor  of  twenty. 

To  allow  the  machine  to  manipulate  data  structures  with  more  than  64K  elements,  the  Connection 
Machine  supports  virtual  processors.  A  single  physical  processor  can  operate  as  a  set  of  multiple 
virtual  processors  by  serializing  operations  in  time,  and  by  partitioning  the  memory  of  each  pro¬ 
cessor.  This  is  otherwise  invisible  to  the  user.  Connection  Machine  programs  utilize  Common  Lisp 
syntax,  in  a  language  called  *Lisp,  and  are  manipulated  in  the  same  fashion  as  Lisp  programs. 

Powerful  Primitive  Operations 

Many  vision  problems  must  be  solved  by  a  combination  of  communication  modes  on  the  Connection 
Machine.  The  design  of  these  algorithms  takes  advantage  cf  the  underlying  architecture  of  the 
machine  in  novel  ways.  There  are  several  common,  elementary  operations  used  in  this  discussion 
of  parallel  algorithms:  routing  operations,  scanning,  and  distance  doubling. 

Routing 

Memory  in  the  Connection  Machine  is  associated  with  processors.  Local  memory  can  be  accessed 
rapidly.  Memory  of  processors  nearby  in  the  NEWS  network  can  be  accessed  by  passing  it  through 
the  processors  on  the  path  between  the  source  and  the  destination.  At  present,  NEWS  accesses 
in  the  machine  are  made  in  the  same  direction  for  all  processors.  The  router  on  the  Connection 
Machine  provides  parallel  reads  and  writes  among  processor  memory  at  arbitrary  distances  and 
with  arbitrary  patterns.  It  uspc  a  packet-switched  message  routing  scheme  to  direct  messages  along 
the  hypercube  connections  to  their  destinations.  This  powerful  communication  mode  can  be  used  to 
reconfigure  completely,  in  one  parallel  write  operation  taking  one  router  cycle,  a  field  of  information 
in  the  machine.  The  Connection  Machine  supplies  instructions  so  that  many  processors  can  read 
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from  the  same  location  or  write  to  the  same  location,  but  since  these  memory  references  can  cause 
significant  delay,  we  will  usually  only  consider  exclusive  read,  exclusive  write  instructions.  We  will 
usually  not  allow  more  than  one  processor  to  access  the  memory  of  another  processor  at  one  time. 
The  Connection  Machine  can  combine  messages  at  a  destination  by  various  operations,  such  as 
logical  AND,  inclusive  OR,  summation,  and  maximum  or  minimum. 

Scanning 

The  scan  operations  (Blelloch,  1987)  can  be  used  to  simplify  and  speed  up  many  algorithms. 
They  directly  take  advantage  of  the  hypercube  connections  underlying  the  router,  and  can  be 
used  to  distribute  values  among  the  processors  and  to  aggregate  values  using  associative  operators. 
Formally,  the  scan  operation  takes  a  binary  associative  operator  0,  with  identity  0,  and  an  ordered 
set  [«o, a\,. . .  ,an_i],  and  returns  the  set  f«o,(flo  ©  «i ),..., («o  ©  fli  ©  . . .  ©  an-i)]*  This  operation 
is  sometimes  referred  to  as  the  data  independent  prefix  operation.  Binary  associative  operators 
include  minimum,  maximum,  and  plus. 

The  four  scan  operations  plus-scan,  max-scan,  min-scan,  and  copy-scan  are  implemented  in  mi¬ 
crocode,  and  take  about  the  same  amount  of  time  as  a  routing  cycle.  The  copy-scan  operation 
takes  a  value  at  the  first  processor  and  distributes  it  to  the  other  processors.  These  scan  operations 
can  take  segment  bits  that  divide  the  processor  ordering  into  segments.  The  beginning  of  each 
segment  is  marked  by  a  processor  whose  segment  bit  is  set,  and  the  scan  operations  start  over 
again  at  the  beginning  of  each  segment. 

The  scan  operations  also  work  using  the  NEWS  addressing  scheme,  termed  grid-scans.  These 
compute  the  sum,  and  quickly  find  the  maximum,  copy,  or  numbe  values  along  rows  or  columns 
of  the  NEWS  grid. 

For  example,  grid-scans  can  be  used  to  find,  for  each  pixel,  the  sum  of  a  square  region  with  width 
2m  +  1  centered  at  the  pixel.  This  sum  is  computed  using  the  following  steps.  First,  a  plus-scan 
operation  accumulates  partial  sums  for  all  pixels  along  the  rows.  Each  pixel  then  gets  the  result 
of  the  scan  from  the  processor  m  in  front  of  it  and  m  behind  it;  the  difference  of  these  two  values 
represents  the  sum,  for  each  pixel,  of  its  neighborhood  along  the  row.  We  now  execute  the  same 
calculation  on  the  columns,  resulting  in  the  sum,  for  each  pixel,  of  the  elements  in  its  square.  The 
whole  process  only  requires  a  few  scans  and  routing  operations,  and  runs  in  time  independent  of 
the  size  of  m.  The  summation  operations  are  generally  useful  to  accumulate  local  support  in  many 
of  our  algorithms,  such  as  stereo  and  motion. 

Distance  Doubling 

Another  important  primitive  operation  is  distance  doubling  (Wyllie,  1979;  Lim,  1986),  which  can 
be  used  to  compute  the  effect  of  any  binary,  associative  operation,  as  in  scan,  on  processors  linked 
in  a  list  or  a  ring.  For  example,  using  max,  distance  doubling  can  find  the  extremum  of  a  field 
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contained  in  the  processors.  Using  message-passing  on  the  router,  distance  doubling  can  propagate 
die  extreme  value  to  all  processors  in  the  ring  of  N  processors  in  O(logiV)  steps.  Each  step 
involves  two  send  operations.  Tvpicallj,  the  value  to  be  maximized  is  chosen  to  be  the  hypercube 
address.  At  termination,  each  processoi  in  the  ring  knows  the  label  of  the  maximum  processor  in 
the  ring,  hereafter  termed  the  principal  processor.  This  labels  all  connected  processors  uniquely, 
and  nominates  a  processor  as  the  representative  for  the  entire  set  of  connected  processors.  At  the 
same  time,  the  distance  from  the  principal  can  be  computed  in  each  processor.  Each  processor 
initially,  at  step  0,  has  the  address  of  the  next  processor  in  the  ring,  and  a  value  which  is  to  be 
maximized.  At  the  termination  of  the  >ih  step,  a  processor  knows  the  addresses  of  processors  2*  + 1 
away,  and  the  maximum  of  all  values  within  2'-1  processors  away.  In  the  example,  the  maximum 
value  has  been  propagated  to  all  8  processors  in  log  8=3  steps. 


2.4  Early  Vision  Algorithms  and  their  Parallel  Implementation 


2.4.1  Edge  Detection 

Edge  detection  is  a  key  first  step  in  correctly  identifying  physical  changes.  The  apparently  simple 
problem  of  measuring  sharp  brightness  changes  in  the  image  has  proven  to  be  difficult.  It  is  now  clear 
that  edge  detection  should  be  intended  not  simply  as  finding  “edges”  in  the  images,  an  ill-defined 
concept  in  general,  but  as  measuring  appropriate  derivatives  of  the  brightness  data.  This  involves 
the  task-dependent  use  of  different  two-dimensional  derivatives.  In  many  cases,  it  is  appropriate 
to  mark  locations  corresponding  to  appropriate  critical  points  of  the  derivative  such  as  maxima 
or  zeroes.  In  some  cases,  later  algorithms  based  on  these  binary  features  (presence  or  absence  of 
edges)  may  be  equivalent,  or  very  similar,  to  algorithms  that  directly  use  the  continuous  value  of 
the  derivatives.  A  case  in  point  is  provided  by  our  stereo  and  motion  algorithms,  to  be  described 
later.  As  a  consequence,  one  should  not  always  make  a  sharp  distinction  between  edge-based  and 
intensity  based  algorithms;  the  distinction  is  more  blurred,  and  in  some  cases  it  is  almost  a  matter 
of  implementation. 

In  our  current  implementation  of  the  Vision  Machine,  we  are  using  two  different  kinds  of  edges. 
The  first  consists  of  zero-crossings  in  the  Laplacian  of  the  image  filtered  through  an  appropriate 
Gaussian.  The  second  consists  of  the  edges  found  by  Canny’s  edge  detector.  Zero-crossings  can 
be  used  by  our  stereo  and  motion  algorithms  (though  we  have  mainly  used  Canny’s  edges  at  fine 
resolution).  Canny’s  edges  (at  a  coarser  resolution)  are  input  to  the  MRF  integration  scheme. 

Zero-Crossings 
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Because  the  derivative  operation  is  ill-posed,  we  need  to  filter  the  resultant  data  through  an  appro¬ 
priate  low-pass  filter  (Torre  and  Poggio,  1986).  The  filter  of  choice  (but  not  the  only  possibility!) 
is  a  Gaussian  at  a  suitable  spatial  scale.  An  interesting  and  simple  implementation  of  Gaussian 
convolution  relies  on  the  binomial  approximation  to  the  Gaussian  distribution.  This  algorithm 
requires  only  integer  addition,  shifting,  and  local  communication  on  the  2D  mesh,  so  it  can  be 
implemented  on  a  simple  2D  mesh  architecture  (such  as  the  NEWS  network  on  the  Connection 
Machine). 

The  Laplacian  of  a  Gaussian  is  often  approximated  by  the  difference  of  Gaussians.  The  Laplacian 
of  a  Gaussian  can  also  be  computed  by  convolution  \.ith  a  Gaussian  followed  by  convolution  with  a 
discrete  Laplacian;  we  have  implemented  both  on  the  Connection  Machine.  To  detect  zero-crossings, 
the  computation  at  each  pixel  need  only  examine  the  sign  bits  of  neighboring  pixels. 

Canny  Edge  Detection 

The  Canny  edge  detector  is  often  used  in  image  understanding.  It  is  based  on  directional  derivatives, 
so  it  has  improved  localization.  The  Canny  edge  detector  on  the  Connection  Machine  consists  of 
the  following  steps: 

•  Gaussian  smoothing, 

•  Directional  derivative, 

•  Non-maximum  suppression, 

•  Thresholding  with  hysteresis. 

Gaussian  filtering,  as  described  above,  is  a  local  operation.  Computing  directional  derivatives  is 
also  local,  using  a  finite  difference  approximation  referencing  only  local  neighbors  in  the  image  grid. 

Non-maximum  Suppression 

Non- maximum  suppression  selects  as  edge  candidates  those  pixels  for  which  the  gradient  magnitude 
is  maximal  in  the  direction  of  the  gradient.  This  involves  interpolating  the  gradient  magnitude 
between  each  of  two  pairs  of  adjacent  pixels  among  the  eight  neighbors  of  a  pixel,  one  forward  in 
the  gradient  direction,  and  one  backward.  However,  it  may  not  be  crucial  to  use  interpolation,  in 
which  case  magnitudes  of  neighboring  values  can  be  directly  compared. 

Thresholding  xoitk  Hysteresis 

Thresholding  with  hysteresis  eliminates  weak  edges  due  to  noise,  using  the  threshold,  while  con¬ 
necting  extended  curves  over  small  gaps  using  hysteresis.  Two  thresholds  are  computed,  low  and 
high,  based  on  an  estimate  of  the  noise  in  the  image  brightness.  The  non-maximum  suppression 
step  selects  those  pixels  where  the  gradient  magnitude  is  maximal  in  the  direction  of  the  gradient. 
In  the  thresholding  step,  all  selected  pixels  with  gradient  magnitude  below  low  are  eliminated.  All 
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pixels  with  values  above  high  are  considered  as  edges.  All  pixels  with  values  between  low  and  high 
are  edges  if  they  can  be  connected  to  a  pixel  above  high  through  a  chain  of  pixels  above  low.  All 
others  are  eliminated. 

This  is  a  spreading  activation  operation;  it  propagates  information  along  a  set  of  connected  edge 
pixels.  The  algorithm  iterates,  in  each  step  marking  as  edge  pixels  any  low  pixels  adjacent  to 
edge  pixels.  When  no  pixels  change  state,  the  iteration  terminates,  taking  0(m)  steps,  a  number 
proportioned  to  the  length  m  of  the  longest  chain  of  low  pixels  which  eventually  become  edge  pixels. 
The  running  time  of  this  operation  can  be  reduced  to  0(log  m),  using  distance  doubling. 

Noise  Estimation 

Estimating  noise  in  the  image  can  be  done  by  analyzing  a  histogram  of  the  gradient  magnitudes. 
Most  computational  implementations  of  this  step  perform  a  global  analysis  of  the  gradient  magni¬ 
tude  distribution,  which  is  essentially  non-local;  we  have  had  success  with  a  Connection  Machine 
implementation  using  local  histograms.  The  thresholds  used  in  Canny  edge  detection  depend  on  the 
final  task  for  which  the  edges  are  used.  A  conservative  strategy  can  use  an  arbitrary  low  threshold 
to  eliminate  the  need  for  the  costly  processing,  required  to  accumulate  a  histogram.  Where  a  more 
precise  estimate  of  noise  is  needed,  it  may  be  possible  to  find  a  scheme  that  uses  a  coarse  estimate 
of  the  gradient  magnitude  distribution,  with  minimal  global  communication. 


2.4.2  Stereo 


The  Drumheller-Poggio  parallel  stereo  algorithm  (Druir.heller  and  Poggio,  1986)  runs  as  part  of 
the  Vision  Machine.  Disparity  data  produced  by  the  algorithm  comprise  one  of  the  inputs  to  the 
MRF  based  integration  stage  of  the  Vision  Machine.  We  are  exploring  various  extensions  of  the 
algorithm,  as  well  as  the  possible  use  of  feedback  from  the  integration  stage.  In  this  section,  we 
will  review  the  algorithm  briefly,  then  proceed  to  a  discussion  of  current  research. 

The  stereo  algorithm  runs  on  the  Connection  Machine  system  with  good  results  on  natural  scenes 
in  times  that  are  typically  on  the  order  of  one  second.  The  stereo  algorithm  is  presently  being 
extended  in  the  context  of  the  Vision  Machine  project. 


The  Drumheller-Poggio  Stereo  Algorithm 


Stereo  matching  is  an  ill- posed  problem  (see  Sertero  et  al.,  198S)  that  cannot  be  solved  without 
taking  advantage  of  natural  constraints.  The  continuity  constraint  (see,  for  instance,  Marr  and 
Poggio,  1976)  asserts  that  the  world  consists  primarily  of  piecewise  smooth  surfaces.  If  the  scene 
contains  no  transparent  objects,  then  the  uniqueness  constraint  applies:  there  can  be  only  one 
match  along  the  left  or  right  lines  of  sight.  If  there  are  no  narrow  occluding  objects,  the  ordering 
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constraint  (Yuille  and  Poggio,  1984)  holds:  any  two  points  must  be  imaged  in  the  same  relative 
order  in  the  left  and  rignt  eves. 

The  specific  a  priori  assumption  on  which  the  algorithm  is  based  is  that  the  disparity,  that  is,  the 
depth  of  the  surface,  is  locally  constant  in  a  small  region  surrounding  a  pixel.  It  is  a  restrictive 
assumption  which,  however,  may  be  a  satisfactory  local  approximation  in  many  cases  (it  can  be 
extended  to  more  general  surface  assumptions  in  a  straightforward  way,  but  at  a  high  computational 
cost).  Let  Ei(x,y)  and  £fl(x,//)  represent  the  left  and  the  right  image  of  a  stereo  pair,  or  some 
transformation  of  it,  such  as  filtered  images  or  a  map  of  the  zero-crossings  in  the  two  images  (more 
generally,  they  can  be  maps  containing  a  feature  vector  at  each  location  ( x,y )  in  the  image). 

We  look  for  a  discrete  disparity  d{x,y )  at  each  location  x,y  in  the  image  that  minimizes 


\\EL{x.y)  -  Er(x  +  d(x,  J/),  J/)||Patch; 

where  the  norm  is  a  summation  over  a  local  neighborhood  centered  at  each  location  (x,  y);  d(x)  is 
assumed  constant  in  the  neighborhood.  The  previous  equation  implies  that  we  should  look  at  each 
( x.y )  for  d(x,y)  such  that 


/  (El(x,  v)Er{x  +  d{x ,  y),  y)fdxdy 

Jp  atchj 


(i) 


is  maximized 


The  algorithm  that  we  have  implemented  on  the  Connection  Machine  is  actually  somewhat  more 
complicated,  since  it  involves  geometric  constraints  that  affect  the  way  the  maximum  operation  is 
performed  (see  Drumheller  and  Poggio,  1986).  The  implementation  currently  used  in  the  Vision 
Machine  at  the  AI  Laboratory  uses  the  maps  of  Canny  edges  obtained  from  each  image  for  El  and 
Er. 

In  more  detail,  the  algorithm  is  composed  of  the  following  steps: 

1)  Compute  features  for  matching. 

2)  Compute  potential  matches  between  features. 

3)  Determine  the  degree  of  continuity  around  each  potential  match. 

4)  Choose  correct  matches  based  on  the  constraints  of  continuity,  uniqueness,  and  ordering. 

Potential  matches  between  features  are  computed  in  the  following  way.  Assuming  that  the  images 
are  registered  so  that  the  epipolar  lines  are  horizontal,  the  stereo  matching  problem  becomes  one¬ 
dimensional:  an  edge  in  the  left  image  can  match  any  of  the  edges  in  the  corresponding  horizontal 
scan  line  in  the  right  image.  Sliding  the  right  image  over  the  left  image  horizontally,  we  compute 
a  set  of  potential  match  planes ,  one  for  each  horizontal  disparity.  Let  p(x,y,d )  denote  the  value  of 
the  (x,y)  entry  of  the  potential  match  plane  at  disparity  d.  We  set  p(x,j/,d)  =  1  if  there  is  an  edge 
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at  location  (x,y)  in  the  left  image  and  a  compatible  edge  at  location  (x  -  d,y)  in  the  right  image; 
otherwise,  set  p{x,y.d)  =  0.  In  the  case  of  the  DOG  edge  detector,  two  edges  are  compatible  if  the 
sign  of  the  convolution  for  each,  edge  is  the  same. 

To  determine  he.degree  of  continuity  around  each  potential  match  ( x.y.d ),  we  compute  a  local 
support  score  s{x,y.d)  =  X)patc/,p(-TM/,d),  where  patch  is  a  small  neighborhood  of  ( x,y,d )  within 
the  dth  potential  match  plane.  In  effect,  nearby  points  in  patch  can  “vote”  for  the  disparity  d. 
The  score  s(x,y,d)  will  be  high  if  the  continuity  constraint  is  satisfied  near  (a ',y,d),  i.e.,  if  patch 
contains  many  votes.  This  step  corresponds  to  the  integral  over  the  patch  in  the  last  equation. 

Finally,  we  attempt  to  select  the  correct  matches  by  applying  the  uniqueness  and  ordering  con¬ 
straints  (see  above).  To  apply  the  uniqueness  constraint,  each  match  suppresses  all  other  matches 
along  the  left  and  right  lines  of  sight  with  weaker  scores.  To  enforce  the  ordering  constraint,  if  two 
matches  are  not  imaged  in  the  same  relative  order  in  left  and  right  views  we  discard  the  match 
with  the  smaller  support  score.  In  effect,  each  match  suppresses  matches  with  lower  scores  in  its 
forbidden  zone  (Yuilte  and  Poggio,  1984).  This  step  corresponds  to  choosing  the  disparity  value 
that  maximizes  the  integral  of  the  last  equation. 

Improvements 

Using  this  algorithm  as  a  base,  we  have  explored  several  of  the  following  topics: 

Detection  of  Depth  Discontinuities 

The  Marr-Poggio  continuity  constraint  is  both  a  strength  and  a  weakness  of  the  stereo  algorithm. 
Favoring  continuous  disparity  surfaces  reduces  the  solution  space  tremendously,  but  also  tends  to 
smooth  over  depth  discontinuities  present  in  the  scene.  Consider  what  happens  near  a  linear  depth 
discontinuity,  say  a  point  near  the  edge  of  a  table  viewed  from  above.  The  square  local  support 
neighborhood  for  the  point  will  be  divided  between  points  on  the  table  and  points  on  the  floor; 
thus,  almost  half  of  the  votes  will  be  for  the  wrong  disparity. 

One  solution  to  this  problem  is  feedback  from  the  MRF  integration  stage.  We  can  take  the  depth 
discontinuities  located  by  the  integration  stage  (using  the  results  from  a  first  pass  of  the  stereo 
algorithm,  among  other  inputs)  and  use  them  to  restrict  the  local  support  neighborhoods  so  that 
they  do  not  span  discontinuities.  In  the  example  mentioned  above,  the  support  neighborhood  would 
be  trimmed  to  avoid  crossing  the  discontinuity  between  the  table  and  the  floor,  and  thus  would  not 
pick  up  spurious  votes  from  the  floor. 

We  can  also  try  to  locate  discontinuities  by  examining  intermediate  results  of  the  stereo  algorithm. 
Consider  a  histogram  of  votes  vs.  disparity  for  the  table/floor  example.  For  a  support  region 
centered  near  the  edge  of  the  table,  we  expect  to  see  two  strong  peaks:  one  at  the  disparity  of  the 
floor,  and  the  other  at  the  disparity  of  the  table.  Therefore  a  bimodal  histogram  is  strong  evidence 
for  the  presence  of  a  discontinuity. 
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Tb«ce  two  ideas  can  be  used  in  conjunction.  Discontinuity  detection  within  stereo  can  take  ad¬ 
vantage  of  the  extra  information  provided  by  the  vote  histograms.  By  passing  better  depth  data 
(and  perhaps  candidate  discontinuity  locations)  to  the  integration  stage,  we  improve  thj  detection 
of  discontinuities  at  the  higher  level. 

Improving  the  Stereo  Matcher 

The  original  Drumheller-Poggio  algorithm  matched  DOG  zero-crossings,  where  the  local  support 
score  counted  the  number  of  zero- crossings  in  the  left  image  patch  matching  edges  in  the  right 
image  patch  at  a  given  disparity.  We  have  modified  the  matcher  in  a  variety  of  ways. 

1)  Canny  edges.  The  matcher  now  uses  edges  derived  by  a  parallel  implementation  of  the  Canny 
edge  detector  (Canny,  1983;  Little  et,  al.,  1987)  rather  than  DOG  zero-crossings,  for  better  local¬ 
ization. 

2)  Gradient  direction  constraint.  We  allow  two  Canny  edges  to  match  only  if  the  associated 
brightness  gradient  directions  are  aligned  within  a  parameterized  tolerance.  This  is  analogous  to  the 
restriction  in  the  Marr-Poggio-Grimson  stereo  algorithm  (Crimson,  1981),  where  two  zero-crossings 
can  match  only  if  the  directions  of  the  DOG  gradients  are  approximately  equal.  Matching  gradient 
orientations  is  a  tighter  constraint  than  matching  the  sign  of  the  DOG  convolution.  Furthermore, 
the  DOG  sign  is  numerically  unstable  for  horizontally  oriented  edges. 

3)  The  scores  are  now  normalized  to  take  into  account  the  number  of  edges  in  the  left  and  right 
image  patches  eligible  to  match,  so  that  patches  with  high  edge  densities  do  not  generate  artificially 
high  scores.  We  plan  to  change  the  matcher  so  that  edges  that  fail  to  match  would  count  as  negative 
evidence  by  reducing  the  support  score,  but  this  has  not  yet  been  implemented. 

In  the  near  future,  we  will  explore  matching  brightness  values  as  well  as  edges,  using  a  cross- 
correlation  approach  similar  to  that  of  Little,  Buelthoff  and  Poggio  (1987)  for  motion  estimation. 

Identifying  Areas  that  are  Outside  of  the  Matcher's  Disparity  Range 

The  stereo  algorithm  searches  a  limited  disparity  range,  selected  manually.  Every  potential  match 
in  the  scene  (an  edge  with  a  matching  edge  at  some  disparity)  is  assigned  the  in-range  disparity 
with  the  highest  score,  even  though  the  correct  disparity  may  be  out  of  range.  How  can  we  tel) 
when  an  area  of  the  scene  is  out  of  range?  The  most  effective  approach  that  we  have  attempted 
to  date  is  to  look  for  regions  with  low  matching  scores.  Two  patches  that  are  incorrectly  matched 
will,  in  general,  produce  a  low  matching  score. 

Memory-Based  Registration  and  Calibration 

Registration  of  the  image  pair  for  the  stereo  algorithm  is  done  by  presenting  to  the  system  a 
pattern  of  dots,  roughly  on  a  sparse  grid,  at  the  distance  around  which  stereo  has  to  operate.  The 
registration  is  accomplished  using  a  warping  computed  by  matching  the  dots  from  the  left  and 
right  images.  The  dots  are  sparse  enough  that  matching  is  unambiguous.  The  matching  defines  a 
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warping  vector  for  each  dot;  at  other  points  the  warping  is  computed  by  bilinear  interpolation  of 
the  two  components  of  warping  vectors.  The  warping  necessary  for  mapping  the  right  image  onto 
the  left  image  is  then  stored.  Prior  to  stereo-matching,  the  right  image  is  warped  according  to  the 
pre-stored  addresses  by  sending  each  pixel  in  the  right  image  to  the  processor  specified  in  the  table. 

The  warping  table  corrects  for  deformations,  including  those  due  to  vertical  disparities  and  ro¬ 
tations,  those  due  to  the  image  geometry  (errors  in  the  alignment  of  the  cameras,  perspective 
projection,  errors  introduced  by  the  optics,  etc.).  We  plan  to  store  several  warping  tables  for  each 
of  a  few  convergence  angles  of  the  two  cameras  (assuming  symmetric  convergence).  We  conjecture 
that  simple  interpolation  can  yield  sufficiently  accurate  warping  tables  for  fixation  angles  inter¬ 
mediate  to  the  ones  stored.  Notice  that  these  tables  are  independent  of  the  position  of  the  head. 
Absolute  depth  is  not  the  concern  here  (we  are  not  using  it  in  our  present  Vision  Machine),  but 
it  could  easily  be  recovered  from  knowledge  of  the  convergence  angle.  Notice  also  that  the  whole 
registration  scheme  has  the  flavor  of  a  learning  process.  Convergence  angles  are  inputs  and  warping 
tables  are  the  outputs  of  the  modules;  the  set  of  angles,  together  with  the  associated  warping  tables, 
represent  the  set  of  input-output  examples.  The  system  can  “generalize”  by  interpolating  between 
warping  tables  and  providing  the  warping  corresponding  to  a  vergence  angle  that  docs  not  appear 
in  the  set  of  “examples”.  Calibration  of  disparity  to  depth  could  be  done  in  a  similar  way. 


2.4.3  Motion 

The  motion  algorithm  computes  the  optical  flow  field,  a  vector  field  that  approximates  the  projected 
motion  field.  The  procedure  produces  sparse  or  dense  output,  depending  on  whether  it  uses  edge 
features  or  intensities.  The  algorithm  assumes  that  image  displacements  are  small,  within  a  range 
(±£,±6).  It  is  also  assumed  that  the  optical  flow  is  locally  constant  in  a  small  region  surrounding 
a  point.  This  assumption  is  strictly  only  true  for  translational  motion  of  3D  planar  surface  patches 
parallel  to  the  image  plane.  It  is  a  restrictive  assumption  which,  however,  may  be  a  satisfactory 
local  approximation  in  many  cases.  Let  Et(x,y )  and  Et+&t(x,  y)  represent  transformations  of  two 
discrete  images  separated  by  time  interval  At,  such  as  filtered  images,  or  a  map  of  the  brightness 
changes  in  the  two  images  (more  generally,  they  can  be  maps  containing  a  feature  vector  at  each 
location  (a :,y)  in  the  image)  (Xass,  1986;  Nishihara,  1984). 

We  look  for  a  discrete  motion  displacement  v  =  (vx,vy)  at  each  location  x,y  in  the  image  that 
minimizes 


II Et{x,y)  -  JW*  +  vxALy  +  UyA^Hpatchj  =  min) 

where  the  norm  is  a  summation  over  a  local  neighborhood  centered  at  each  location  (x,  y)\  i i(x,y) 
is  assumed  constant  in  the  neighborhood.  The  previous  equation  implies  that  we  should  look  at 
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each  {x,y)  for  v  =  {vr,  vy )  such  that 


Et+At(x  +  vxAt,y- i-  Vy£t)f(lxdy 


(2) 


is  minimized.  Alternatively,  one  can  maximize  the  negative  of  the  integrated  result.  The  last 
equation  represents  the  sum  of  the  pointwise  squared  differences  between  a  patch  in  the  first  image 
centered  around  the  location  (x.y)  and  a  patch  in  the  secci  d  image  centered  around  the  location 
(a:  4-  t’jA t.y  +  vyAt). 


This  algorithm  can  be  translated  easily  into  the  following  description.  Consider  a  network  of 
processors  representing  the  result  of  the  integrand  in  the  previous  expression.  Assume  for  simplicity 
that  this  result  is  either  0  or  1  (this  is  the  case  if  Et  and  Et+At  are  binary  feature  maps).  The 
processors  hold  the  result  of  differencing  (taking  the  logical  “exclusive  or”)  the  right  and  left  image 
map  for  different  values  of  (x,ij)  and  vx,  vy.  The  next  stage,  corresponding  exactly  to  the  integral 
operation  over  the  patch,  is  for  each  processor  to  summate  the  total  in  an  (x,  y)  neighborhood  at  the 
same  disparity.  Note  that  this  summation  operation  is  efficiently  implemented  n  the  Connection 
Machine  using  scan  computations.  Each  processor  thus  collects  a  vote  indicating  support  that  a 
patch  of  surface  exists  at  that  displacement.  The  algorithm  iterates  over  all  displacements  in  the 
range  (±5,  ±<5),  recording  the  values  of  the  integral  for  each  displacement.  The  last  stage  is  to  choose 
y{x,y)  among  the  displacements  in  the  allowed  range  that  maximizes  the  integral.  This  is  done  by 
an  operation  of  “non-maximum  suppression"  across  velocities  out  of  the  finite  allowed  set.  at  the 
given  ( x,y ),  the  processor  is  found  that  has  the  maximum  vote.  The  corresponding  v(x,y)  is  the 
velocity  of  the  surface  patch  found  by  the  algorithm.  The  actual  implementation  of  this  schtine  can 
be  simplified  so  that  the  “non-maximum  suppression”  occurs  during  iteration  over  displacements, 
so  that  no  actual  table  of  summed  differences  over  displacements  need  be  constructed.  In  practice, 
the  algorithm  has  been  shown  to  be  effective  both  for  synthetic  and  natural  images  using  different 
types  of  features  or  measurements  on  the  brightness  data,  including  edges  (both  zero-crossings  of 
the  Laplacian  of  Gaussian  and  Canny’s  method),  which  generate  sparse  results  along  brightness 
edges,  or  brightness  data  directly,  or  the  Laplacian  of  Gaussian,  or  its  sign,  which  generate  dense 
results.  Because  the  optical  flow  is  computed  from  quantities  integrated  over  the  individual  patches, 
the  results  are  robust  against  the  effects  of  uncorrelated  noise. 


The  comparison  stage  employs  patchwise  cross- correlation,  which  exploits  local  constancy  of  the 
optical  flow  (the  velocity  field  is  guaranteed  to  be  constant  for  translations  parallel  to  the  image 
plane  of  a  planar  surface  patch);  it  is  a  cubic  polynomi.  1  for  arbitrary  motion  of  a  planar  surface 
(see  Waxman,  1987;  Little  et  al.,  1987).  Experimentally,  we  have  used  zero-crossings,  the  Laplacian 
of  Gaussian  filtered  image,  its  sign,  and  the  smoothed  brightness  values,  with  similar  results.  It 
is  interesting  that  methods  superficially  so  different  (edge-based  and  intensity-based)  give  such 
similar  results.  As  we  mentioned  earlier,  this  is  not  surprising.  There  are  theoretical  arguments 
that  support,  for  instance,  the  equivalence  of  cross-correlating  the  sign  bit  of  the  Laplacian  filtered 
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image  and  the  Laplacian  filtered  image  itself.  The  argument  is  based  on  the  following  theorem 
(see  Little,  BueltholT,  and  Poggio,  in  preparation),  which  is  a  slight  reformulation  of  a  well-known 
theorem. 

Theorem 

If  f(x,y )  and  g{x.  y)  are  zero  mean  jointly  normal  processes,  their  cross-correlation  is  determined 
fully  by  the  correlation  of  the  sign  of  /  and  of  the  sign  of  g  (and  determines  it).  In  particular 

R  j  -  =  -avcsin(Rf„) 

J  IT 

where  /  =  sign  f  and  g  -  sign  g 

Thus,  cross-correlation  of  the  sign  bit  is  exactly  equivalent  to  cross-cor.  Nation  of  the  signal  itself 
(for  Gaussian  processes).  Notice  that  from  the  point  of  view  of  information,  the  sign  bit  of  the 
signal  is  completely  equivalent  to  the  zero-crossing  of  the  signal.  Nishihara  first  used  patchwise 
cross-correlation  of  the  sign  bit  of  DOG  filtered  images  (Nishihara,  1984),  and  has  implemented  it 
more  recently  on  real-time  hardware  (Nishihara  and  Crossley,  1988). 

The  existence  of  discontinuities  can  be  detected  in  optical  flow,  as  in  stereo,  both  during  compu¬ 
tation  and  by  processing  the  resulting  flow  field.  The  latter  field  is  input  to  the  MRF  integration 
stage.  During  computation,  discontinuities  in  optical  flow  arising  from  occlusions  are  indicated  by 
low  normalized  scores  for  the  chosen  displacement. 


2.4.4  Color 

The  color  algorithm  that  we  have  implemented  is  a  very  preliminary  version  of  a  module  that 
should  find  the  boundaries  in  the  surface  spectral  reflectance  function,  that  is,  discontinuities  in 
the  surface  color.  The  algorithm  relies  on  the  idea  of  effective  illumination  and  on  the  single  source 
assumption,  both  introduced  by  Hurlbert  and  Poggio  (see  Poggio  et  al.,  1985). 

The  single  source  assumption  states  that  the  illumination  may  be  separated  into  two  components, 
one  dependent  only  on  wavelength,  and  one  dependent  only  on  spatial  coordinates;  this  generally 
holds  for  illumination  from  a  single  light  source.  It  allows  us  to  write  the  image  irradiance  equation 
for  a  Lambertian  world  as 


r  =  kvE{x,y)pv{x,y) 


where  Iu  is  the  image  irradiance  in  the  vth  spectral  channel  (v  =  red, green,  blue),  pv(x,y)  is  the 
surface  spectral  reflectance  (or  albedo),  and  the  effective  illumination  E(x,y)  absorbs  the  spatial 
variations  of  the  illumination  and  the  shading  due  to  the  3D  shape  of  surfaces  (ku  is  a  constant 
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for  each  channel,  and  depends  only  on  the  luminant).  A  simple  segmentation  algorithm  is  then 
obtained  by  considering  the  equation 


H(x,tj)  jT  +  Ig  krpr  +  [.gpu 

which  changes  only  when  pT,  or  p9,  or  both  change.  Thus  //.  which  is  piecewise  constant,  has 
discontinuities  that  mark  changes  in  the  surface  albedo,  independently  of  changes  in  the  effective 
illumination. 

The  quantity  H{. v,y)  is  defined  almost  everywhere,  but  is  typically  noisy.  To  counter  the  effect  of 
noise,  we  exploit  the  prior  information  that  H  should  be  piecewise  constant  with  discontinuities 
that  are  themselves  continuous,  non-intersecting  lines.  As  we  will  discuss  later,  this  restoration  step 
is  achieved  by  using  a  MRF  model.  This  algorithm  works  only  under  the  restrictive  assumption 
that  specular  reflections  can  be  neglected.  Hurlbert  (1989)  discusses  in  more  detail  the  scheme 
outlined  here  and  how  it  can  be  extended  to  more  general  conditions. 


2.4.5  Textvi::e 


The  texture  algorithm  is  a  greatly  simplified  parallel  version  of  the  texture  algorithm  developed 
by  Voorhees  and  Poggio  (1987).  Texture  is  a  scalar  measure  computed  by  summation  of  texton 
densities  over  small  regions  surrounding  ever  ’  point.  Discontinuities  in  this  measure  can  corre¬ 
spond  to  occlusion  boundaries,  or  orientation  discontinuities,  which  cause  foreshortening.  Textons 
are  computed  in  the  image  by  simple  approximation  to  the  methods  presented  in  Voorhees  and 
Poggio  (1987).  For  this  example,  the  textons  are  restricted  to  blob-like  regions,  without  regard  to 
orientation  selection. 


To  compute  textons,  the  image  is  first  filtered  by  a  Laplacian  of  Gaussian  filter  at  several  different 
scales.  The  smallest  scale  selects  the  textural  elements.  The  Laplacian  of  Gaussian  image  is  then 
thresholded  at  a  non-zero  value  to  find  the  regions  which  comprise  the  blobs  identified  by  the 
textons.  The  result  is  a  binary  image  with  non-zero  values  only  in  the  areas  of  the  blobs.  A  simple 
summation  counts  the  density  of  blobs  (the  portion  of  the  summation  region  covered  by  blobs)  in 
a  small  area  surrounding  each  point.  This  operation  effectively  measures  the  density  of  blobs  at 
the  small  scale,  while  also  counting  the  presence  of  blobs  caused  by  large  occlusion  edges  at  the 
boundaries  of  textured  regions.  Contrast  boundaries  appear  as  blobs  in  the  Laplacian  of  Gaussian 
image.  To  remove  their  effect,  we  use  the  Laplacian  of  Gaussian  image  at  a  slightly  coarser  scale. 
Blobs  caused  by  the  texture  at  the  fine  scale  do  not  appear  at  this  coarser  scale,  while  the  contrast 
boundaries,  as  well  as  all  other  blobs  at  coarser  scales,  remain.  This  coarse  blob  image  filters  the  fine 
blobs;  blobs  at  the  coarser  scale  are  removed  from  the  fine  scale  image.  Then,  summation,  whether 
with  a  simple  scan  operation,  or  Gaussian  filtering,  can  determine  the  blob  density  at  the  fine  scale 
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only.  This  is  one  example  where  multiple  spatial  scales  are  used  m  the  present  implementation  of 
the  Vision  Machine. 


2.4.6  The  Integration  Stage  and  MRF 

Whereas  it  is  reasonable  that  combining  the  evidence  provided  by  multiple  cues,  for  example,  edge 
detection,  stereo,  and  color,  should  provide  a  more  reliable  map  of  the  surfaces  than  any  single  cue 
alone,  it  is  not  obvious  how  this  integration  can  be  accomplished.  The  various  physical  processes 
that  contribute  to  image  formation,  surface  depth,  surface  orientation,  albedo  (Lambertian  and 
specular  component),  illumination ,  are  coupled  to  the  image  data,  and  therefore  to  each  other, 
through  the  imaging  equation.  The  coupling  is,  however,  difficult  to  exploit  in  a  robust  way,  since 
it  depends  critically  on  the  reflectance  and  imaging  models.  We  argue  that  the  coupling  of  the  image 
data  to  the  surface  and  illumination  properties  is  of  a  more  qualitative  and  robust  sort  at  locations 
in  which  image  brightness  changes  sharply  and  surface  properties  are  discontinuous,  in  short,  at 
edges.  The  intuitive  reason  for  this  is  that  at  discontinuities,  the  coupling  between  different  physical 
processes  and  the  image  data  is  robust  and  qualitative.  For  instance,  a  depth  discontinuity  usually 
originates  a  brightness  edge  in  the  image,  and  a  motion  boundary  often  corresponds  to  a  depth 
discontinuity  (and  a  brightness  edge)  in  the  image.  This  view  suggests  the  following  integration 
scheme  for  restoring  the  data  provided  by  early  modules.  The  results  provided  by  stereo,  motion, 
and  other  visual  cues  are  typically  noisy  and  sparse.  We  can  improve  them  by  exploiting  the  fact 
that  they  should  be  smooth,  or  even  piecewise  constant  (as  in  the  case  of  the  albedo),  between 
discontinuities.  We  can  exploit  a  priori  information  about  generic  properties  of  the  discontinuities 
themselves,  for  instance,  that  they  usually  are  continuous  and  non-intersecting. 

The  idea  is  then  to  detect  discontinuities  in  each  cue,  for  instance  depth,  simultaneously  with  the 
approximation  of  the  depth  data.  The  detection  of  discontinuities  is  helped  by  information  on  the 
presence  and  type  of  discontinuities  in  the  surfaces  and  surface  properties  (see  Figure  1),  which  are 
coupled  to  the  brightness  edges  in  the  image. 

Notice  that  reliable  detection  of  discontinuities  is  critical  for  a  vision  system,  since  discontinuities 
are  often  the  most  important  locations  in  a  scene;  depth  discontinuities,  for  example,  normally 
correspond  to  the  boundaries  of  an  object  or  an  object  part.  The  idea  is  thus  to  couple  different 
cues  through  their  discontinuities  and  to  use  information  from  several  cues  simultaneously  to  help 
refine  the  initial  estimation  of  discontinuities,  which  axe  typically  noisy  and  sparse. 

How  can  this  be  done?  We  have  chosen  to  use  the  machinery  of  Markov  Random  Fields  (MRFs), 
initially  suggested  for  image  processing  by  Geman  and  Geman  (1984).  In  the  following  section, 
we  will  give  a  brief,  informal  outline  of  the  technique  and  of  our  integration  scheme.  More  de¬ 
tailed  information  about  MRFs  can  be  found  in  Geman  and  Geman  (1984)  and  Marroquin  et  al. 
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(1987).  Gamble  and  Poggio  (1987)  describe  an  earlier  version  of  our  integration  scheme  and  its 
implementation  as  outlined  in  the  next  section. 

MRF  Models 

Consider  the  prototypical  problem  of  approximating  a  surface  given  sparse  and  noisy  data  (depth 
data)  on  a  regular  2D  lattice  of  sites.  We  first  define  the  prior  probability  of  the  class  of  surfaces 
we  are  interested  in.  The  probability  of  a  certain  depth  at  any  given  site  in  the  lattice  depends 
only  upon  neighboring  sites  (the  Markov  property).  Because  of  the  Clifford-Hammersley  theorem, 
the  prior  probability  is  guaranteed  to  have  the  Gibbs  form 


vu) 
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where  Z  is  a  normalization  constant,  T  is  called  temperature,  and  U(f)  =  Ylc  UcU)  is  an  energy 
function  that  can  be  computed  as  the  sum  of  local  contributions  from  etch  neighborhood.  The 
sum  of  the  potentials ,  Uc{X),  is  over  the  neighborhood’s  cliques.  A  clique  is  either  a  single  lattice 
site  or  a  set  of  lattice  sites  such  that  any  two  sites  belonging  to  it  are  neighbors  of  one  another. 
Thus  U(f)  can  be  considered  as  the  sum  over  the  possible  configurations  of  each  neighborhood  (see 
Marroquin  et  al..  1987).  As  a  simple  example,  when  the  surfaces  are  expected  to  be  smooth,  the 
prior  probability  can  be  given  as  sums  of  terms  such  as 


Uc(f)  =  (/i  -  fjf 

where  i  and  j  are  neighboring  sites  (belonging  to  the  same  clique). 

If  a  model  of  the  observation  process  is  available  (i.e.,  a  model  of  the  noise),  then  one  can  write  the 
conditional  probability  P(g/  f )  of  the  sparse  observation  g  for  any  given  surface  /.  Bayes  Theorem 
then  allows  one  to  write  the  posterior  distribution 


P(f/9)  = 
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In  the  simple  earlier  example,  we  have  (for  Gaussian  noise) 


U(f/g)  =  -  9i?  +  (/»  -  fj? 

c 

where  7;  =  1  only  where  data  are  available.  More  complicated  cases  can  be  handled  in  a  similar 
manner. 

The  posterior  distribution  cannot  be  solved  analytically,  but  sample  distributions  can  be  obtained 
using  Monte  Carlo  techniques  such  as  the  Metropolis  algorithm.  These  algorithms  sample  the 
space  of  possible  surfaces  according  to  the  probability  distribution  P(f/g)  that  is  determined  by 
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the  prior  knowledge  of  the  allowed  class  of  surfaces,  the  model  of  noise,  and  the  observed  data.  In 
our  implementation,  a  highly  parallel  computer  generates  a  sequence  of  surfaces  from  which,  for 
instance,  the  surface  corresponding  to  the  maximum  of  P(f/g)  can  be  found.  This  corresponds 
to  finding  the  global  minimum  of  U(f/g)  (simulated  annealing  is  one  of  the  possible  techniques). 
Other  criteria  can  be  used:  Marroquin  (1985)  has  shown  that  the  average  surface  /  under  the 
posterior  distribution  is  often  a  better  estimate,  and  one  which  can  be  obtained  more  efficiently  by 
simply  finding  the  average  value  of  /  at  each  lattice  site. 

One  of  the  main  attractions  of  MRFs  is  that  the  prior  probability  distribution  can  be  made  to 
embed  more  sophisticated  assumptions  about  the  world.  Geman  and  Geman  (1984)  introduced  the 
idea  of  another  process,  the  line  process,  located  on  the  dual  lattice,  and  representing  explicitly  the 
presence  or  absence  of  discontinuities  that  break  the  smoothness  assumption.  The  associated  prior 
energy  then  becomes 


Uc(f)  =  (/;  -  /j)2(l  ~lj)  +  PVc(lj) 

where  /  is  a  binary  line  element  between  site  i,j.  Vq  is  a  term  that  reflects  the  fact  that  certain 
configurations  of  the  line  proces.  are  more  likely  than  others  to  occur.  In  our  world,  depth  dis¬ 
continuities  are  usually  themselves  continuous,  non-intersecting,  and  rarely  isolated  joints.  These 
properties  of  physical  discontinuities  can  be  enforced  locally  by  defining  an  appropriate  set  of  en¬ 
ergy  values  Vc(l)  for  different  configurations  of  the  line  process  in  the  neighborhood  of  the  site 
(notice  that  the  assignment  of  zero  energy  values  to  the  non-central  cliques  mentioned  in  Gamble 
and  Poggio  (1987)  is  wrong,  as  pointed  out  to  us  by  Tal  Symchony). 

Organization  of  Integration 

It  is  possible  to  extend  the  energy  function  to  accommodate  the  interaction  of  more  processes  and 
their  discontinuities.  In  particular,  we  have  extended  the  energy  function  to  couple  several  of  the 
early  vision  modules  (depth,  motion,  texture,  and  color)  to  brightness  edges  in  the  image.  This  is  a 
central  point  in  our  integration  scheme;  brightness  edges  guide  the  computation  of  discontinuities  in 
the  physical  properties  of  the  surface,  thereby  coupling  surface  depth,  surface  orientation,  motion, 
texture,  and  color,  each  to  the  image  brightness  data  and  to  each  other.  The  reason  for  the  role 
of  brightness  edges  is  that  changes  in  surface  properties  usually  produce  large  brightness  gradients 
in  the  image.  It  is  exactly  for  this  reason  that  edge  detection  is  so  important  in  both  artificial  and 
biological  vision. 

The  coupling  to  brightness  edges  may  be  done  by  replacing  the  term  Vc(l; )  in  the  last  equation 
with  the  term 


V(l,e)  =  g(eJ;,Vc(li)) 


with  e\  representing  a  measure  of  the  presence  of  an  brightness  edge  between  site  i,j.  The  term 
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g  has  the  effect  of  modifying  the  probability  of  the  line  process  configuration  depending  on  the 
brightness  edge  data  ( V(l,e )  =  -log  p(l/e)).  This  term  facilitates  formation  of  discontinuities 
(that  is,  lJt )  at  the  locations  of  brightness  edges.  Ideally,  the  brightness  edges  (and  the  neighboring 
image  properties)  activate,  with  different  probabilities,  the  different  surface  discontinuities  (see 
Figure  1),  which  in  turn  are  coupled  to  the  output  of  stereo,  motion,  color,  texture,  and  possibly 
other  early  algorithms. 

We  have  been  using  the  MRF  machinery  with  prior  energies  like  that  shown  above  (see  also  Figure 
1)  to  integrate  edge  brightness  data  with  stereo,  motion,  and  texture  information  on  the  MIT  Vision 
Machine  System. 

We  should  emphasize  that  our  present  implementation  represents  a  subset  of  the  possible  interac¬ 
tions  shown  in  Figure  1,  itself  only  a  simplified  version  of  the  organization  of  the  likely  integration 
process.  The  system  will  be  improved  in  an  incremental  fashion,  including  pathways  not  shown  in 
Figure  1.  such  as  feedback  from  the  results  of  integration  into  the  matching  stage  of  the  stereo  and 
motion  algorithms. 

Algorithms:  Deterministic  and  Stochastic 

We  have  chosen  to  use  MRF  models  because  of  their  generality  and  theoretical  attractiveness.  This 
does  not  imply  that  stochastic  algorithms  must  be  used.  For  instance,  in  the  cases  in  which  the 
MRF  model  reduces  to  standard  regularization  (Marroquin  et  al.,  1987)  and  the  data  are  given 
on  a  regular  grid,  the  MRF  formulation  leads  not  only  to  a  purely  deterministic  algorithm,  but 
also  to  a  convolution  filter.  Recent  work  in  color  (Hurlbert  and  Poggio,  1989)  shows  that  one  can 
perform  integration  similar  to  the  MRF-based  scheme  using  a  deterministic  update.  Geiger  and 
Girosi  (1989)  have  shown  that  there  is  a  class  of  deterministic  schemes  that  are  the  mean- field 
approximations  of  the  MRF  models.  These  schemes  have  a  much  higher  speed  than  the  Montecarlo 
schemes  we  used  so  far,  while  promising  similar  performance. 


2.5  Illustrative  Results 

Figures  2  and  3  show  the  results  of  the  Vision  Machine  applied  to  the  scene  in  Figure  2  and  some 
of  the  intermediate  steps.  Figure  3  shows  the  brightness  edges  computed  by  the  Canny  algorithm 
at  two  different  spatial  scales  (cr  =  2.5  and  cr  =  4).  We  show  neither  the  stereo  pair  nor  the  motion 
sequence  in  which  the  teddy  bear  was  rolling  slightly  on  his  back  from  one  frame  to  the  next. 
The  results  given  by  the  stereo,  motion,  texture  and  color  algorithms,  after  an  initial  smoothing 
to  make  them  dense  (see  Gamble  and  Poggio,  1987),  are  shown  in  the  first  column  on  the  left  of 
Figure  4  (from  top  to  bottom).  They  represent  the  input  to  the  MRF  machinery  that  integrates 
each  of  those  data  sets  with  the  brightness  edges.  The  color  algorithm  uses  the  edges  at  the 
coarser  resolution,  since  we  want  to  avoid  detecting  texture  marks  on  the  surface;  the  other  cues 
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Figure  2:  Grey-level  image  of  a  natural  scene  processed  by  the  Vision  Ma¬ 
chine. 


Figure  3:  Canny  edges  of  the  images  in  Figure  2. 


Figure  4:  MRF  results  for  stereo,  motion,  texture,  and  color. 


Figure  5:  Union  of  depth  and  motion  discontinuities. 


are  integrated  with  the  Canny  edges  at  a  smaller  scale  ( a  =  2.5).  The  central  column  ot  Figure  4 
shows  the  reconstructed  depth,  color  (the  quantity  //  defined  earlier),  texture  and  motion  flow;  the 
left  column  show  the  discontinuities  found  by  the  MRF  machinery  in  each  of  the  cues.  Processing 
of  the  stereo  output  finds  depth  discontinuities  in  the  scene  (mainly  the  outlines  of  the  teddy,  plus 
a  fold  of  a  wet  suit  protruding  outward).  Motion  discontinuities  are  found  by  the  MRF  machinery 
with  help  from  brightness  edges.  The  color  boundaries  show  regions  of  constant  surface  color, 
independently  of  its  shading:  notice,  for  instance,  that  brightness  edges  inside  the  teddy  bear,  due 
to  shading,  do  not  appear  as  color  edges  (the  color  images  were  taken  from  a  different  camera). 
The  texture  boundaries  correspond  quite  well  to  different  textured  surfaces. 

Figure  5  shows  that  the  union  of  the  discontinuities  in  depth  and  motion  for  the  scene  of  Figure  2 
gives  a  rather  good  “cartoon’’  of  the  original  scene.  At  the  same  time,  our  integration  algorithm 
achieves  a  preliminary  classification  of  the  brightness  edges  in  the  image,  in  terms  of  their  physical 
origin.  A  more  complete  classification  will  be  achieved  by  the  full  scheme  in  :  the  lattices  at  the 
top  classify  the  different  types  of  discontinuities  in  the  scene.  The  set  of  such  discontinuities  in  the 
various  physical  processes  should  represent  a  good  set  of  data  for  later  recognition  stages. 


2.6  Recognition 

The  output  of  the  integration  stage  provides  a  set  of  edges  labeled  in  terms  of  physical  discontinuities 
of  the  surface  properties.  They  represent  a  good  input  to  a  model- based  recognition  algorithm  like 
the  ones  described  by  Dan  Huttenlocher  and  Todd  Cass  in  the  1988  Proceedings  of  the  Image 
Understanding  Workshop.  In  particular,  we  have  interfaced  the  Vision  Machine  as  implemented  so 
far  with  the  Cass  algorithm.  We  have  used  only  discontinuities  for  recognition;  later  we  will  also 
use  the  information  provided  by  the  MRFs  on  the  surface  properties  between  discontinuities. 

We  have  more  ambitious  goals  for  the  recognition  stage  of  the  Vision  Machine.  In  an  unconstrained 
environment  the  library  of  models  that  a  system  with  human-level  performance  requires  is  in  the 
order  of  many  thousands.  Thus,  the  ability  to  learn  from  examples  appears  to  be  essential  for  the 
achievemen'  of  high  performance  in  real-world  recognition  tasks.  Learning  the  models  becomes 
then  a  primary  concern  in  developing  a  recognition  system  for  the  Vision  Machine.  This  has  not 
been  the  case  in  other  approaches  of  the  last  few  years,  mainly  motivated  by  a  robotic  framework. 

2.6.1  Learning  in  a  ihree-siage  recognition  scheme 

Although  some  of  the  existing  recognition  systems  incorporate  a  module  for  learning  object  models 
from  examples  (e.g.,  Tucker’s  2D  system  [67])  no  such  capability  exists  yet  for  the  more  difficult 
problems  of  recognizing  3D  objects  [37]  O'  handwriting  [16].  We  believe  that  incorporating  learn¬ 
ing  into  a  general-purpose  recognition  system  may  be  facilitated  by  breaking  down  the  task  of 
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recognition  into  three  distinct  but  interacting  stages:  selection,  indexing  and  verification. 

Selection 

Selection  or  segmentation  breaks  down  the  image  into  regions  that  are  likely  to  correspond  to 
single  objects.  The  utility  of  an  early  segmentation  of  a  scene  into  meaningful  entities  lies  in  the 
great  reduction  of  complexity  of  scene  interpretation.  Each  of  the  detected  objects  can  in  turn  be 
subjected  to  separate  recognition,  by  comparing  it  with  object  models  stored  in  memory.  Without 
prior  segmentation,  every  possible  combination  of  image  primitives  such  as  lines  and  blobs  can  in 
principle  constitute  an  object  and  must  be  checked  out.  The  power  of  early  segmentation  may 
be  enhanced  by  integrating  all  available  visual  cues,  especially  if  the  integration  parameters  are 
automatically  adjusted  to  suit  the  particular  scene  in  question. 

Indexing 

By  indexing  we  mean  defining  a  small  set  of  candidate  objects  that  are  likely  to  be  present  in 
the  image.  Although  one  cannot  hope  to  achieve  an  ideal  segmentation  in  real-world  situations, 
partial  success  is  sufficient  if  the  indexing  process  is  robust.  Assuming  that  most  objects  in  the 
real  world  are  redundantly  specified  by  their  local  features,  a  good  indexing  mechanism  would  use 
such  features  to  overcome  changes  in  viewpoint  and  illumination,  occlusion  and  noise. 

What  kind  of  feature  is  good  for  indexing?  Reliably  detected  lines  provided  by  the  integration  of 
several  low-level  cues  in  the  process  of  segmentation  may  suffice  in  many  cases.  We  conjecture  that 
simple  viewpoint-invariant  combinations  of  primitive  elements,  such  as  two  lines  forming  a  corner, 
parallel  lines  and  symmetry  are  also  likely  to  be  useful.  Ideally,  only  2D  information  should  be 
used  for  indexing,  although  it  may  be  augmented  sometimes  by  qualitative  3D  cues  such  as  relative 
depth. 

Verification 

In  the  verification  stage  each  of  the  candidates  screened  by  the  indexing  process  is  tested  to  find  the 
best  match  to  the  image.  At  this  stage,  the  system  can  afford  to  perform  complicated  tests,  since  the 
number  of  candidate  objects  is  small.  We  conjecture  that  hierarchical  indexing  by  a  small  number 
(two  or  three)  features  that  are  spatially  localized  in  2D  suffices  to  achieve  useful  interpretations 
of  most  everyday  scenes.  In  general,  however,  further  verification  by  task-dependent  routines  [68] 
or  precise  shape  matching,  possibly  involving  3D  information,  is  required  [69]  [47]  [37][67]  [7]  [1]. 

2.7  Future  Developments 

The  Vision  Machine  should  evolve  in  several  parallel  directions: 
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•  improvement  and  extensions  of  its  early  modules 

•  improvement  of  the  integration  and  recognition  stages  (recognition  is  discussed  later) 

•  use  of  the  eye-head  system  in  an  active  mode  during  recognition  task  by  developing 
appropriate  gaze  strategies 

•  use  of  the  results  of  the  integration  stage  in  order  to  improve  the  operation  of  early  modules 
such  as  stereo  and  motion  by  feeding  back  the  preliminary  computation  of  the  discontinuities 

Two  goals  will  occupy  most  of  our  attention,  if  we  will  be  able  to  continue  to  work  on  the  project. 
The  first  one  is  the  development  of  the  overall  organization  of  the  Vision  Machine.  The  system  can 
be  seen  as  an  implementation  of  the  inverse  optics  paradigm:  it  attempts  to  extract  surface  proper¬ 
ties  from  the  integration  of  image  cues.  It  must  be  stressed  that  we  never  intended  this  framework 
to  imply  that  precise  surface  properties  such  as  dense,  high  resolution  depth  maps,  must  be  deliv¬ 
ered  by  the  system.  This  extreme  interpretation  of  inverse  optics  seems  to  be  common,  but  was 
not  the  motivation  of  our  project,  which  originally  started  with  the  name  Coarse  Vision  Machine 
to  emphasize  the  importance  of  computing  qualitative,  as  opposed  to  very  precise,  properties  of 
the  environment. 

Our  second  main  goal  in  the  Vision  machine  project  will  be  Machine  Learning,  that  we  will  discuss  in 
the  next  chapter.  In  particular,  we  have  begun  to  explore  simple  learning  and  estimation  techniques 
for  vision  tasks.  We  have  succeded  in  synthetizing  a  color  algorithm  from  examples  [36]  and  in 
developing  a  technique  to  perform  unsupervised  learning  [63]  of  other  simple  vision  algorithms 
such  as  simple  versions  of  the  computation  of  texture  and  stereo.  In  addition,  we  have  used 
learning  techniques  to  perform  integration  tasks,  such  as  labeling  the  type  of  discontinuities  in 
a  scene.  We  have  also  begun  to  explore  the  connections  between  recent  approaches  to  learning, 
such  as  neural  networks,  genetic  algorithms,  and  classical  methods  in  approximation  theory  such 
as  splines,  Bayesian  techniques  and  Markov  Random  Field  models,  as  discussed  in  one  of  the  next 
chapters.  We  have  identified  some  common  properties  of  all  these  approaches  and  some  of  the 
common  limitations,  such  as  sample  complexity.  As  a  consequence,  we  now  believe  that  we  can 
leverage  our  expertise  in  approximation  techniques  for  the  problem  of  learning  in  machine  vision. 

For  further  details  and  background  information  on  this  work,  see  the  following  refernces:  [75,  60, 
3,  46,  39,  32,  74,  43,  71,  8,  49, 10,  41,  64,  65,  40,  59,  58,  70]. 
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3  VLSI 


3.0.1  A  VLSI  Vision  Machine? 

Our  Vision  Machine  consists  mostly  of  specialized  software  running  on  a  general  purpose  computer, 
the  Connection  Machine.  This  is  a  good  system  for  the  present  stage  of  experimentation  and 
development.  Later,  on  »  we  have  perfected  and  tested  the  algorithms  and  the  overall  system, 
it  will  make  sense  to  compile  the  software  in  silicon  in  order  to  produce  a  faster,  cheaper,  and 
smaller  Vision  Machine.  We  are  presently  planning  to  use  VLSI  technologies  to  develop  some 
initial  chips  as  a  first  step  toward  this  goal.  In  this  section,  we  will  outline  some  thoughts  about 
VLSI  implementation  of  the  Vision  Machine. 

Algorithms  and  Hardware 

We  realize  that  our  specialized  software  vision  algorithms  are  not,  in  general,  optimized  for  hard¬ 
ware  implementation.  So,  rather  than  directly  “hardwiring  algorithms”  into  standard  computing 
circuitry,  we  will  be  investigating  “algorithmic  hardware”  designs  that  utilize  the  local,  symmetric 
nature  of  early  vision  problems.  This  will  be  an  iterative  process,  as  the  algorithm  influences  the 
hardware  design  and  as  hardware  constraints  modify  the  algorithm. 

Degree  of  Parallelism 

Typical  vision  tasks  require  tremendous  amounts  of  computing  power,  and  are  usually  parallel  in 
nature.  As  an  example,  biological  vision  uses  highly  parallel  networks  of  relatively  slow  components 
to  achieve  sophisticated  systems.  However,  when  implementing  our  algorithms  in  silicon  integrated 
circuits,  it  is  not  clear  what  level  of  parallelism  is  necessary.  While  biology  is  able  to  use  three 
dimensions  to  construct  highly  interconnected  parallel  networks,  VLSI  is  limited  to  2  ^  dimensions, 
making  highly  parallel  networks  much  more  difficult  and  costly  to  implement.  However,  the  elec¬ 
trical  components  of  silicon  integrated  circuits  are  approximately  four  orders  of  magnitude  faster 
than  the  electrochemical  components  of  biology.  This  suggests  that  pipelined  processing  or  other 
methods  of  time-sharing  computing  power  may  be  able  to  compensate  for  the  lower  degree  of  con¬ 
nectivity  of  silicon  VLSI.  Clearly,  the  architecture  of  a  VLSI  vision  system  may  not  resemble  any 
biological  vision  systems. 

Signal  Representation 

Within  the  integrated  circuit,  the  image  data  may  be  represented  as  a  digital  word  or  an  analog 
value.  While  the  advantages  of  digital  computation  are  its  accuracy  and  speed,  digital  circuits  do 
not  have  as  high  a  degree  of  functionality  per  device  as  analog  circuits.  Therefore,  analog  circuits 
should  allow  much  denser  computing  networks.  This  is  particularly  important  for  the  integration  of 
computational  circuitry  and  photosensors,  which  will  help  to  alleviate  the  I/O  bottleneck  typically 
experienced  whenever  image  data  are  serially  transferred  between  Vision  Machine  components. 
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However,  analog  circuits  are  limited  in  accuracy,  and  are  difficult  to  characterize  and  design. 

The  primary  motivation  for  a  VLSI  implementation  of  our  Vision  Machine  is  to  increase  the  compu¬ 
tational  speed  and  reduce  t  he  physical  size  of  the  components,  with  the  eventual  goal  of  real-time, 
mobile  vision  systems.  While  the  main  computational  engine  of  our  Vision  Machine  is  the  Connec¬ 
tion  Machine,  which  is  a  very  powerful  and  flexible  SIMD  computer,  specific  VLSI  implementations 
will  attempt  to  tradeoff  computational  flexibility  for  faster  performance  and  higher  degree  of  in¬ 
tegration.  A  VLSI  implementation  of  our  Vision  Machine  can  offer  significant  improvements  in 
performance  that  would  be  difficult  or  impossible  to  attain  by  other  methods.  Presently,  we  are 
specifically  investigating  the  integration  of  charge  coupled  devices  for  photosensing  and  simple  par¬ 
allel  computations,  such  as  binomial  convolution  and  patchwise  correlation.  In  particular,  Woody 
Yang  has  developed  and  fabricated  CCDs  circuits  for  signal  processing  and  imaging,  described  some 
basic  operations  and  how  those  operations  can  be  combined  into  a  CCD  processor  architecture  for 
vision.  A  circuit  for  performing  Laplacian-of-Gaussian  filtering  of  the  image  has  been  sent  to  fab¬ 
rication.  The  paper  discusses  other  CCD  circuits  for  the  integration-reconstruction  stage  of  the 
Vision  Machine  and  for  stereo. 


4  Learning 


Poggio  and  Girosi  have  recently  obtained  what  we  believe  is  a  satisfactory  understanding  of  the 
learning  obtained  by  ‘'neural”  networks  such  as  backpropagation.  In  the  last  Proceedings  we 
had  drawn  a  formal  analogy  between  simple  forms  of  learning  and  hypersurface  reconstruction. 
As  a  consequence,  learning  can  be  achieved  by  techniques  such  as  regularization  and  therefore 
generalized  splines.  The  connection,  however,  between  these  classical  methods  and  feedforward 
networks  of  the  backpropagation  type  remained  unclear.  Poggio  and  Girosi  have  now  found  that 
the  missing  link  is  provided  by  the  approximation  method  of  Radial  Basis  Functions.  The  Radial 
Basis  Function  approximation  method  has  a  sound  theoretical  basis  and  a  direct  interpretation  in 
term  of  a  feedforward  network  with  one  “hidden”  layer.  Poggio  and  Girosi  have  been  able  to  prove 
its  connections  to  generalized  splines,  to  regularization  techniques  and  to  Bayes’  approaches.  They 
have  developed  several  new  extensions  of  the  method  and  indicated  how  to  address  a  few  general 
issues  in  networks  and  learning  within  its  formal  framework  (Girosi  and  Poggio,  1989,  1990)  . 


We  describe  briefly  the  interpolation  and  approximation  technique  called  Radial  Basis  Functions, 
which  has  been  used  in  the  past  for  surface  interpolation  with  very  promising  results;  clearly  surface 
reconstruction  is  another  application  of  this  technique  of  interest  to  vision  research. 
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4.1  Radial  Basis  Functions 


Given  a  set  D  —  {(£,-,  i/;)  €  Rn  x  R\i  =  1....JV}  of  data  to  interpolate,  the  Radial  Basis  Function 
method  corresponds  to  choosing  the  form  of  the  interpolating  function  as 


F(.r)  =  Y^cM\\x~  -Vi\\2) 

i'=i 

where  h  is  a  smooth  univariate  function  defined  on  [0,oo)  and  ||  •  ||  is  a  norm  on  Rn.  This  formula 
means  that  the  interpolating  function  is  expanded  on  a  finite  N-elements  basis  that  is  given  from 
the  set  of  functions  h  translated  and  centered  at  data  points.  The  N  unknown  coefficients  of  the 
expansion  can  be  recovered  imposing  the  interpolating  conditions  F(x{)  =  l'\  This  gives  the  linear 
system 


N 

i'=i 

Defining  the  vectors  Y,  c  and  the  symmetric  matrix  H  as  follows 


(K);  =  Y{,  (c);  =  c,-,  (H)ij  =  KWxj  -  x{||2) 


we  obtain 


c=  H-'Y 

provided  H  is  invertible.  The  invertibility  of  H  depends  on  the  choice  of  the  function  h.  In  fact 
Micchelli  proved  the  following  theorem,  that  defines  a  class  of  functions  that  we  can  choose  to  form 
the  basis: 

Theorem  4.1.1  Let  G  be  a  continuous  function  on  [0,oo)  and  positive  on  (0,oo).  Suppose  its 
first  derivative  is  completely  monotonic  but  not  constant  on  (0,oo).  Then  for  any  distinct  vectors 
xu...,xN  €  Rn 


(-l)n  1detG(\\x{  -  Xj ||2)  >  0 

The  interpolation  conditions  can  be  weakened  if  the  number  of  knots  is  made  lower  than  the 
number  of  data  and  their  coordinates  are  allowed  to  be  chosen  arbitrarily.  In  this  case,  denoting 
with  t\,...,tji  the  coordinates  of  the  K  knots  ( K  <  N )  the  interpolation  conditions  give  the  linear 
system  Y  =  He  where  ( H)ia  =  /i(||xi  -  £»||2)  (i  =  1, ...» iV  and  a  =  1  The  matrix  H  being 

rectangular  ( N  x  A'),  this  system  is  overconstrained  and  the  problem  must  be  then  regularized 
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to  obtain  a  reasonable  set  of  coefficients  for  the  expansion.  A  least-squares  approach  can  then  be 
adopted  and  the  optimal  solution  can  be  written  as 

c  =  H+Y 

where  H+  is  the  Moore-Penrose  pseudo-inverse.  In  the  overdetermined  case,  one  has 

H+  =  (HtH)~'Ht. 

As  in  the  previous  case  this  formulation  makes  sense  if  the  matrix  H T II  is  non  singular.  Micchelli's 
theorem  is  still  relevant  to  this  problem,  since  Poggio  and  Girosi  proved  the  following  corollary: 

Theorem  4.1.2  Let  G  be  a  function  satisfying  the  conditions  of  Micchelli's  theorem  and 
x\ , ...,  x,\  a  N-tupla  of  vectors  in  Rn.  If  H  is  the  (N  -  s)  x  N  matrix  II  obtained  from  the  matrix 
Gi,j  =  G(||x;  —  Xj||2)  deleting  s  arbitrary  rows,  then  the  (N  —  s)  x  (N  -  s)  matrix  IIT H  is  not 
singular. 

The  first  layer  consists  of  “input”  units  whose  number  is  equivalent  to  the  number  of  independent 
variables  of  the  problem.  The  second  layer  implements  the  set  of  radial  basis  function  and  its 
number  of  units  is  equal  to  the  number  of  knots.  The  units  of  the  second  layer  are  in  general  fully 
connected  to  the  units  of  the  first  one.  The  third  layer  consists  of  one  unit  (for  a  scalar  function) 
connected  to  all  the  units  of  the  second  layer  and  computing  a  weighted  sum  of  their  outputs.  The 
weights  are  the  coefficients  of  the  radial  basis  expansion  and  are  the  only  unknown  of  the  problem. 
Since  spline  interpolation  can  be  implemented  by  such  a  network,  and  spline  are  known  to  have 
a  large  power  of  approximation  we  have  then  shown  that  a  high  degree  of  approximation  can  be 
obtained  by  just  one  hidden  layer  network. 

4.2  An  extension:  Generalized  Radial  Basis  Functions 

Poggio  and  Girosi  noticed  that  the  knots  of  the  radial  basis  expansion  have  been  kept  fixed,  the 
weights  being  the  only  unknowns.  To  make  the  method  more  flexible  they  propose  to  consider  even 
the  knots  as  unknowns  and  to  look  for  the  configuration  of  weights  and  knots  that  minimizes  the 
least  square  error  on  the  data.  The  problem  consists  then  in  finding  the  values  of  the  coefficients 
c,-  and  knots  ta  that  minimizes  the  function 

e = hri  -  E  aais  -  c.  n2))2- 
1=1  0=1 

A  gradient-descent  approach  can  be  adopted  to  find  the  solution  to  this  problem.  The  values  of 
ca  and  ta  are  then  regarded  as  the  coordinates  of  the  stable  fixed  point  of  the  following  dynamical 
system: 
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cv  =  1,  ...K 
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where  u;  is  a  parameter  determining  the  microscopic  timescale  of  the  problem  and  is  related  to  the 
rate  of  convergence  to  the  fixed  point.  Defining  the  interpolation  error  as 

At-  =  Yi  -  £  cah(\\xi  -  fj2) 

a=l 

we  can  write  the  gradient  terms  as 


OE 

dca 


iV 


1=1 


? 


ftp  N 

=  4ca  £  -  fa||2)(z,-  -  C) 

where  h'  is  the  first  derivatives  of  h.  Equating  W-  to  zero  we  notice  that  at  the  fixed  point  the 

v(q 

knot  vectors  ta  satisfy  the  following  equation: 


r  Ziftl ■ 


where  Pf  =  Af/i'dl*,-  -  toll2)-  The  optimal  knots  are  then  a  weighted  sum  of  the  data  points.  The 
weight  Pf  of  the  data  point  i  for  a  given  knot  a  is  high  if  the  interpolation  error  A;  is  high  there 
and  the  radial  basis  function  centered  on  that  knot  changes  quickly  in  a  neighbor  of  the  data  point. 


4.3  RBF  are  equivalent  to  regularization 

Interesting  connections  between  RBF  and  regularization  techniques  arise  when  the  basis  function 
are  chosen  to  be  Gaussian.  Let  us  consider  the  RBF  method  in  its  original  formulation,  having 
chosen  the  basis  function  to  be  a  Gaussian  G.  The  coefficients  of  the  expansion  are  the  solution  of 
the  linear  system  Y  =  Gc  where  (G)ij  =  G’d|x»  -  Xj||2).  If  data  are  noisy  a  well  known  technique 
[t>5]  to  regularize  the  solution  is  to  substitute  the  previous  linear  system  with  the  following 


Y  =  (G  +  A  I)c 
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where  A  is  a  small  parameter  and  I  is  the  identity  matrix.  We  now  show  that  the  same  approxi¬ 
mating  function  can  be  obtained  from  a  pure  regularization  approach.  Let  us  consider  the  following 
functional 


dx  £  am(Dm F{xi))2 

m—Q 

where  A  is  a  parameter,  D2m  =  V2m,  T>2m+1  =  VV2m,  V2  is  the  Laplacian  operator  and  the 
coefficients  am  are  to  be  chosen.  It  can  be  easily  proved  that  by  posing  am  =  the  function 
that  minimizes  this  functional  can  be  written  as 

F(x)  =  *»'ll2)  (1) 

1=1 

where  G  is  a  Gaussian  of  variance  a  and  the  coefficients  satisfy  the  linear  system  Y  =  (G'  +  A/)c, 
that  is  the  same  as  before.  So  in  this  case  RBF  and  regularization  are  equivalent.  Notice  that 
changing  the  coefficients  am  is  equivalent  to  select’ ng  another  basis  function  h  instead  of  G.  In  fact 
it  can  be  shown  that  the  set  nm  and  h  are  related  by  the  following  distributional  partial  differential 
equation: 


£im=£w-m))2+A  J 


oo 

£(-l )mamV2mh(x)  =  6(x)  . 

m= 0 

The  stabilizer  described  above  is  not  the  most  general  one.  Other  types  could  have  been  chosen, 
depending  on  the  a  priori  information  about  the  surface  to  be  reconstructed.  The  previous  one  is 
suitable  if  we  want  to  keep  local  the  interaction  between  a  data  point  and  its  neighbors,  since  the 
Gaussian  falls  off  very  quikly,  that  is  the  “interaction”  is  short  range.  It  can  be  shown  that  this  is 
related  to  the  presence  of  a  term  of  degree  zero  in  the  stabilizer.  For  example,  in  two  dimensions, 
if  we  chose  a  stabilizer  like 


+ 


2' 


this  leads  to  a  Radial  Basis  Function  of  the  type  /i(||*||2)  =  ||af||2/o</||a;||.  This  kind  of  interaction 
is  clearly  long-range,  as  it  should  be,  since  the  corresponding  functional  is  the  bending  energy  of  a 
thin  plate  of  infinite  extent  (Duchon  and  Meinguet  gave  the  name  thin  plate  splines  to  the  solution 
of  the  interpolation  problem  obtained  minimizing  this  functional). 

The  same  kind  of  results  can  be  obtained  in  a  third  way,  in  the  networks  framework.  Let  us  consider 
the  network  and  the  problem  of  finding  the  “synaptic”  weights.  I  we  adopt  a  least  square  criterion 
we  recover  the  usual  linear  system  Y  =  Gc,  but  often  it  is  considered  an  advantage  to  keep  the 
connections  from  growing  to  infinity,  and  so  the  following  functional  is  minimized: 
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ei{f) = D«  -  E  c''G<  ll*  -  *iil2  )f+*£‘? 

i  t'=l  « 

where  the  last  term  gives  an  high  price  to  the  configurations  in  which  some  coefficient  c,-  is  very  high. 
It  is  immediate  to  see  that  the  minimization  of  this  functional  leads  to  the  solution  of  the  linear 
system  }’  =  (<?  +  A f)c.  This  shows  the  equivalence  between  some  of  the  “new”  neural  networks 
techniques  and  classical  regularization. 

5  Other  Work 

5.1  Labeling  the  physical  origin  of  edges:  computing  qualitative  surface  at¬ 
tributes 

Physical  Discontinuities 

We  classify  edges  according  to  the  following  physical  events:  discontinuities  in  surface  properties, 
called  mark  or  albedo  edges  (e.g.,  changes  in  the  color  of  the  surface);  discontinuities  in  the  orienta¬ 
tion  of  the  surface  patch,  called  orientation  edges  (e.g.,  an  edge  in  a  polyhedron);  discontinuities  in 
the  illumination,  called  shadow  edges;  occluding  boundaries,  which  are  discontinuities  in  the  object 
space  (a  different  object);  and  specular  discontinuities,  which  exist  for  non- Lambertian  objects. 


Gamble,  Geiger,  Poggio,  and  Weinshall  have  implemented  a  part  of  the  general  scheme  [18].  More 
specifically,  they  have  used  a  simple  linear  classifier  to  label  edges  at  pixels  where  there  exists  an 
intensity  discontinuity,  using  the  output  of  the  line  process  associated  with  each  low-level  vision 
module.  They  use  the  fact  that  the  modules’  discontinuities  are  aligned,  having  being  integrated 
with  the  intensity  edges  before,  so  that  the  nonexistence  of  a  module  discontinuity  at  a  pixel  is 
meaningful.  The  linear  classifier  corresponds  to  a  linear  network  where  each  output  unit  is  a 
weighted  linear  combination  of  its  inputs  (for  a  similar  application  to  a  problem  of  color  vision,  see 
[36]).  The  input  to  the  network  is  a  pixel  where  there  exists  an  intensity  edge  and  that  feeds  a  set 
of  qualitatively  different  input  units.  The  output  is  a  real  value  vector  of  labels’  support. 


In  the  system  we  have  so  far  implemented,  we  achieve  a  rather  restricted  integration,  since  each 
module  is  integrated  only  with  the  intensity  module,  and  labeling  is  done  via  a  simple  linear  classifier 
only.  It  is  still  unclear  how  successful!  labeling  can  be,  using  only  local  information. 
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5.2  Saliency,  grouping  and  segmentation 

A  grouping  and  segmentation  module  working  on  the  output  of  the  edge  detection  module  is  an 
important  part  of  a  vision  system:  humans  can  deal  with  monocular,  still,  black  and  white  pictures 
devoid  of  stereo,  motion  and  color.  We  are  now  developing  techniques  to  find  salient  edges,  to 
group  them  and  thereby  segment  the  image.  These  algorithms  have  not  been  integrated  yet  in  the 
Vision  Machine  system. 

5.2.1  Saliency  Measure 

Edge  maps  produced  by  most  current  edge  detectors  are  cluttered  with  edge  responses  and  may  have 
edges  caused  by  noise.  This  creates  difficulties  for  higher  level  processing,  since  the  combinatorics  of 
these  algorithms  often  depends  on  the  number  of  edge  primitives  being  examined.  What  is  needed 
is  a  technique  to  focus  attention  on  the  ‘‘important”  edges  in  a  scene.  We  call  such  attention 
focusing  techniques  that  measure  the  “importance”  of  an  edge  saliency  measures.  Shimon  Ullman 
has  proposed  two  different  kinds  of  saliency  measures:  local  saliency  and  structural  saliency.  An 
edge’s  local  saliency  is  entirely  determined  by  features  of  that  edge  alone.  For  example,  an  edge’s 
length,  its  average  gradient  magnitude,  or  the  color  of  a  bounding  region  serve  as  local  saliency 
measures.  Structural  saliency  refers  to  more  global  properties  of  an  edge  -  its  relationships  with 
other  edges.  Although  two  edges  may  not  be  locally  salient,  if  there  is  a  “nonaccidental”  relationship 
between  them,  then  the  structure  becomes  salient.  Examples  of  “nonaccidental”  relationships,  as 
pointed  out  by  David  Lowe,  include  collinearity,  parallelism,  and  symmetry,  among  other  things. 

We  have  investigated  local  saliency  measures  applied  to  the  output  of  the  Canny  edge  detector 
(Beymer,  in  preparation).  The  edge  features  we  have  considered  include  curvature,  edge  length, 
and  gradient  magnitude.  The  measure  favors  those  edges  that  have  low  average  curvature,  long 
length,  and  a  high  gradient  magnitude.  The  saliency  measure  eliminates  many  of  the  edges  due 
to  noise  and  many  of  the  unimportant  edges.  The  edges  that  remain  are  often  the  long,  smooth 
boundaries  of  objects  and  significant  intensity  changes  inside  the  objects.  We  expect  that  the 
salient  edges  will  help  higher  level  processes  such  as  grouping  (structural  saliency)  and  model 
based  recognition  by  allowing  them  to  focus  attention  on  regions  of  an  image  bounded  by  salient 
edges. 

5.2.2  T  Junctions:  Their  Detection  and  Use  in  Grouping 

In  cluttered  imagery,  imagery  containing  many  objects  occluding  one  another,  it  is  important  to 
group  together  pieces  of  the  image  that  come  from  the  same  object.  In  particular,  given  an  edge 
map  produced  by  the  Canny  edge  detector,  we  would  like  to  select  and  group  together  the  edges 
from  a  particular  object  before  running  high  level  recognition  algorithms  on  the  edge  data.  This 
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grouping  stage  helps  reduce  the  combinatorics  of  the  higher  level  stages,  as  they  are  not  forced  to 
consider  false  edge  groupings  as  objects.  Considering  how  occlusion  cues  can  be  used  in  grouping, 
we  have  investigated  the  detection  of  T  junctions  and  grouping  rules  arising  from  the  pairing  of  T 
junctions.  When  one  object  partially  occludes  another  in  a  cluttered  scene,  a  T  junction  is  formed 
between  the  two  objects.  David  Bevmer  has  developed  algorithms  for  detecting  T  junctions  as  a 
postprocessing  step  to  the  Canny  edge  detector.  The  Canny  edge  detector,  while  very  good  at 
detecting  edges,  is  particularly  bad  at  detecting  junctions.  Indeed,  it  was  designed  to  detect  one 
dimensional  events.  This  one  dimensional  characterization  of  the  image  breaks  down  at  junctions 
since  locally  there  are  three  or  more  surfaces  in  the  image.  We  have  investigated  how  one  could  use 
edge  curvature  and  region  properties  of  the  image  to  reconstruct  these  “broken”  junctions.  Often 
the  way  Canny  will  fail  at  junctions  is  that  one  of  the  three  curves  belonging  to  the  junction  will  be 
broken  off  from  the  other  two.  We  have  modified  an  existing  algorithm  asnd  achieved  promisisng 
results  in  restoring  broken  T  junctions.  Once  located  in  the  image,  T  junctions  are  represented  by 
three  edges,  the  left  part  of  the  top  horizontal  edge  of  the  T,  the  right  part,  and  the  stem.  The 
top  horizontal  edges  are  the  occluding  edges  and  the  vertical  stem  is  the  occluded  edge.  Given 
the  junctions,  we  can  start  pairing  T  junctions  and  grouping  edge  fragments.  If  we  assume  that 
all  objects  in  the  scene  fit  entirely  within  the  image  boundaries,  all  T  junctions  must  be  matched 
up  with  a  “brother”  T  junction  along  the  occluded  edge  joining  them.  This  constraint  helps  to 
classify  T  junctions,  making  their  detection  more  robust.  Once  a  T  junction  is  matched  with  its 
brother,  we  know  exactly  which  edge  is  the  occluded  edge  (it  is  the  edge  that  is  traced  to  reach  the 
brother),  so  we  can  group  the  two  occluding  edges  together.  The  occluded  edge  will  be  extended, 
starting  a  search  process  to  bridge  the  occluding  object.  Here  we  are  looking  for  an  opposing  T 
junction  on  the  other  side  of  the  occluding  object.  If  such  a  pair  ol  opposing  Ts  is  found,  we  can 
group  together  the  occluded  edges  of  the  respective  T  junctions.  The  application  of  these  grouping 
rules  for  occluding  and  occluded  edges  often  product  closed  contours  when  the  Canny  edges  are 
fairly  good.  For  each  closed  contour,  we  can  form  a  closed  region  corresponding  to  an  object  or 
object  part  in  the  image.  Finally,  the  T  junctions  are  used  to  calculate  relative  depth  information 
among  the  regions.  In  the  end,  the  system  can  divide  the  image  into  regions  corresponding  to 
objects  and  give  their  relative  depths.  The  algorithm  is  presently  working  on  “toy”  images  made 
from  construction  paper  cutouts  and  has  not  been  integrated  in  the  Vision  Machine  system. 


5.3  Fast  Vision:  The  Role  of  Time  Smoothness 

The  present  version  of  the  Vision  Machine  processes  only  isolated  frames.  Even  our  motion  algo¬ 
rithm  takes  as  input  simply  a  sequence  of  two  images.  The  reason  for  this  is,  of  course,  limitations 
in  raw  speed.  We  cannot  perform  all  of  the  processing  we  do  at  video  rate  (say,  30  frames  per 
second),  though  this  goal  is  certainly  within  present  technological  capabilities.  If  we  could  process 
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frames  at  video  rate,  we  could  exploit  constraints  in  the  time  dimension  similar  to  the  ones  we 
are  already  exploiting  in  the  space  domain.  Surfaces,  and  even  the  brightness  array  itself,  do  not 
usually  change  too  much  from  frame  to  frame.  This  is  a  constraint  of  smoothness  in  time,  which  is 
valid  almost  everywhere,  but  not  across  discontinuities  in  time.  Thus  one  may  use  the  same  MRF 
technique,  applied  to  the  output  of  stereo,  motion,  color,  and  texture,  and  enforce  continuity  in 
time  (if  there  are  no  discontinuities),  that  is,  exploit  the  redundancy  in  the  sequence  of  frames. 

We  believe  that  the  surface  reconstructed  from  a  stereo  pair  usually  does  not  need  to  be  recomputed 
completely  when  the  next  stereo  pair  is  taken  a  fraction  of  a  second  later.  Of  course,  the  role  of 
the  MRFs  may  be  accomplished  in  this  case  by  some  more  specific  and  more  efficient  deterministic 
method  such  as.  for  example,  a  form  of  Kalman  filtering.  Notice  that  space-time  MRFs  applied  to 
the  brightness  arrays  would  yield  spatiotemporal  interpolation  and  approximation  of  a  kind  already 
considered  (Faille  and  Poggio,  1980;  Poggio,  Nielsen,  and  Nishihara,  1982;  Bliss.  1985). 

5.4  Parameter  Estimation  in  the  MRF  integration  stage 

Using  the  MRF  model  involves  an  energy  function  which  has  several  free  parameters,  in  addition  to 
the  many  possible  neighborhood  systems.  The  values  of  these  parameters  determine  a  distribution 
over  the  configuration-space  to  which  the  system  converges,  and  the  speed  of  convergence.  Thus 
rigorous  methods  for  estimating  these  parameters  are  essential  for  the  practical  success  of  the 
method  and  for  meaningful  results.  In  some  cases,  parameters  can  be  learned  from  the  data:  e.g., 
texture  parameters  (Geman  and  Graffigne,  1987),  or  neighborhood  parameters  (for  which  a  cellular 
automaton  model  may  be  the  most  convenient  for  the  purpose  of  learning).  There  are  general 
statistical  methods  which  can  be  used  for  parameter  estimation: 

•  A  maximum  likelihood  estimate  -  one  can  use  the  indirect  iterative  EM  algorithm 
(Dempster  et.al.,  1977),  which  is  most  useful  for  maximum  likelihood  estimation  from 
incomplete  data  (see  Marroquin,  1987  for  a  special  case  ).  This  algorithm  involves  the 
iterative  maximization  (over  the  parameter  space)  of  the  expected  value  of  the  likelihood 
function  given  that  the  parameters  take  the  values  of  their  estimation  in  the  previous 
iteration.  Alternatively,  a  search  constrained  by  some  statistics  for  a  minimum  of  an 
appropriate  merit  function  may  be  employed  (see  Marroquin,  1987). 

•  A  smoothing  (regularization)  parameter  can  be  estimated  using  the  methods  of 
cross-validation  or  unbiased  risk,  to  minimize' the  mean  square  error.  In  cross-validation,  an 
estimate  is  obtained  omitting  one  data  point.  The  goal  is  to  minimize  the  distance  between 
the  predicted  data  point  (from  the  estimate  above  with  the  point  omitted)  and  the  actual 
value,  for  all  points. 
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In  the  case  of  Markov  Random  Fields,  some  more  specific  approaches  are  appropriate  for  parameter 
estimation: 

1)  Besag  (1972)  suggested  conditional  maximum  likelihood  estimation  using  coding  methods,  maxi¬ 
mum  likelihood  estimation  with  unilateral  approximations  on  the  rectangular  lattice,  or  “maximum 
pseudolikelihood”  -  a  method  to  estimate  parameters  for  homogeneous  random  fields  (see  Geman 
and  Graffigne,  1987). 

2)  For  the  MPM  estimator,  where  a  fixed  temperature  is  yet  another  parameter  to  be  estimated,  one 
can  try  to  use  the  physics  behind  the  model  to  find  a  temperature  with  as  little  disorder  as  possible 
and  still  reasonable  time  of  convergence  to  equilibrium  (e.g.,  away  from  “phase-transition”). 

An  alternative  asymptotic  approach  can  be  used  with  smoothing  (regularization)  terms:  instead  of 
estimating  the  smoothing  parameter,  let  it  tend  to  0  as  the  temperature  tends  to  0,  to  reduce  the 
smoothing  close  to  the  final  configuration  (see  Geman  and  Geman,  1987). 

In  summary,  we  plan  to  explore  three  distinct  stages  for  parameter  estimation  in  the  integration 
stage  of  the  Vision  Machine: 

•  Modeling  (from  the  physics  of  surfaces,  of  the  imaging  process  and  of  the  class  of  scenes  to 
be  analyzed  and  the  tasks  to  be  performed)  and  the  form  of  the  prior  and  of  some 
conditional  probabilities  involved  (e.g.,  the  type  of  physical  edges  from  properties  of  the 
measurements,  such  as  characteristics  of  the  brightness  data).  Range  of  allowed  parameter 
values  may  also  be  established  at  this  stage  (e.g.,  minimum  and  maximum  brightness  value 
in  a  scene,  depth  differences,  positivity  of  certain  measurements,  distribution  of  expected 
velocities,  reflectance  properties,  characteristics  of  the  illuminant,  etc.). 

•  Estimating  of  parameter  values  from  set  of  examples  in  which  data  and  desired  solution  are 
given.  This  is  a  learning  stage.  We  may  have  to  use  days  of  CM  time  and,  at  least  initially, 
synthetic  images  to  do  this. 

•  Tuning  of  some  of  the  parameters  directly  from  the  data  (by  using  EM  algorithm, 
cross-validation,  Besag’s  work,  or  various  types  of  heuristics). 

Tue  dream  is  that  at  some  point  in  the  future  the  Vision  Machine  will  run  all  the  time,  day  and 
night,  looking  about  and  learning  on  its  own  to  see  better  and  better. 

5.5  Object  Recognition 

In  earlier  reports,  we  have  described  a  series  of  approaches  to  the  problem  of  model-based  object 
recognition,  based  on  matching  object  shape.  Our  work  has  proceeded  along  a  number  of  fronts. 
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5.5.1  Recognition  from  Matched  Dimensionalities 


Earlier  reports  described  the  work  of  Crimson  and  Lozano-Perez  on  the  recognition  of  occluded 
objects  from  noisy  sensory  data  under  the  condition  of  matched  dimensionality  [29].  Specifically,  if 
the  objects  to  be  recognized  and  localized  are  laminar  and  lie  on  a  flat  surface,  or  if  the  objects  are 
volumetric  but  lie  in  stable  configurations  on  a  flat  surface,  then  the  sensory  data  need  only  be  two- 
dimensional  (e.g.  a  single  image);  if  the  objects  to  be  recognized  and  localized  are  volumetric  and 
lie  in  arbitrary  positions,  then  the  sensory  data  must  be  three  dimensional  (e.g.  stereo  or  motion 
data,  laser  range  data).  The  original  technique  (called  RAF)  was  designed  to  recognize  polyhedral 
objects  from  simple  measurements  of  the  position  and  surface  orientation  of  small  patches  of  surface. 
The  technique  searches  for  consistent  matchings  between  the  faces  of  ^he  object  models  and  the 
sensory  measurements,  using  constraints  on  the  relative  shape  of  pairs  of  model  faces  and  pairs  of 
measurements  to  reduce  the  search. 

Our  empirical  work  on  RAF  has  advanced  along  a  number  of  dimensions.  First,  we  have  shown  that 
the  RAF  framework  can  successfully  recognize  and  locate  objects  based  on  a  variety  of  geometric 
features:  edges,  vertices,  curved  arcs,  planar  surface  patches,  and  axes  of  cylinders  and  cones. 
Second,  we  have  also  shown  that  such  features  can  be  extracted  from  a  range  of  sensory  information, 
including  grey  level  images,  stereo  data,  motion  data,  sonar  returns,  laser  striping  data  and  tactile 
data.  Third,  we  have  shown  that  the  RAF  framework  can  be  extended  to  deal  with  some  classes 
of  parameterized  objects.  These  include  the  recognition  of  objects  that  can  scale  in  size,  the 
recognition  of  objects  that  are  composed  of  rigid  subparts  connected  through  rotational  degrees 
of  freedom  (e.g.  a  pair  of  scissors)  and  the  recognition  of  objects  that  can  undergo  a  stretching 
deformation  along  one  axis. 

Our  empirical  experience  with  RAF  suggested  that  the  method  was  remarkably  efficient  when 
dealing  with  data  from  a  single  object,  but  was  inefficient  when  spurious  data  was  included.  To 
overcome  this,  we  have  incorporated  a  Hough  transform  to  preselect  portions  of  the  search  space 
on  which  to  focus  attention,  and  we  have  used  thresholds  on  the  goodness  of  an  interpretation  to 
terminate  search.  The  combination  of  these  two  techniques  resulted  in  dramatic  improvement  in 
the  efficiency  of  the  search  method.  Based  on  these  observations,  we  have  been  developing  a  formal 
basis  for  explaining  these  results.  In  particular,  we  have  shown  the  following  formal  results: 

•  If  all  of  the  data  is  known  to  have  come  from  a  single  object,  the  expected  amount  of  search 
is  quadratic  in  the  number  of  data  and  model  features. 

•  If  spurious  data  is  included,  the  expected  amount  of  search  is  a  combination  of  polynomial 
in  the  number  of  data  and  model  features,  bvt  exponential  in  the  size  of  the  actual  correct 
interpretation. 

•  Us.ng  a  Hough  transform  to  preselect  subspaces  of  the  search  space  reduces  the  values  of 
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the  parameters  in  the  complexity  bounds,  but  still  leaves  an  exponential  problem. 

•  Using  premature  termination  of  search  based  on  a  threshold  on  a  “good"  interpretation 
reduces  the  expected  search.  In  particular,  if  the  scene  clutter  is  small  enough  relative  to 
the  noise  in. the  data,  the  expected  search  becomes  polynomial,  otherwise  it  is  a  low  order 
exponential. 

To  support  the  use  of  Hough  transforms  and  premature  termination  of  search,  Eric  Grimson  and 
Daniel  Huttenlocher  have  executed  a  formal  analysis  of  these  methods  [28].  They  have  derived 
formal  characterizations  for  the  probability  of  false  positives  in  the  Hough  space,  as  a  function  of 
the  noise  in  the  data  and  the  characteristics  of  the  Hough  transform.  These  results  provide  a  means 
of  evaluating  the  efficacy  of  the  Hough  transform,  and  suggest  that  one  should  not,  in  general,  rely 
on  the  Hough  transform  to  fully  solve  the  recognition  problem,  but  rather  that  one  should  use 
it  as  a  preprocessor,  selecting  out  small  subspaces  within  which  the  R  AF  method  can  be  applied 
effectively.  The  results  support  the  empirical  observations  concerning  the  reduction  in  search. 

Grimson  and  Huttenlocher  have  also  developed  a  formal  characterization  of  thresholds  for  termi¬ 
nating  search,  relating  analytic  bounds  on  such  thresholds  to  expected  probabilities  of  errors.  These 
formal  results  have  been  shown  to  agree  with  empirical  evidence  from  several  recognition  systems. 

Much  of  our  earlier  work  with  the  RAF  recognition  system  dealt  with  robotics  environments  and 
the  recognition  of  industrial  parts.  We  have  continued  this  effort-  by  integrating  RAF  into  the 
HANDEY  task-level  planning  system  of  Lozano-Perez.  We  have  also  continued  a  pilot  study  of 
applying  the  technique  to  a  very  different  domain,  underwater  localization.  Specifically,  we  have 
considered  the  problem  of  determining  the  location  of  an  autonomous  underwater  vehicle  by  match¬ 
ing  sensory  data  obtained  by  the  vehicle  against  bathymetric  or  other  maps  of  the  environment. 
Sensor  modalities  include  active  methods  such  as  sonar,  and  passive  methods  such  as  pressure  read¬ 
ings  and  doppler  data  from  passing  ships.  We  have  conducted  some  early  simulation  experiments 
using  RAF,  together  with  strategies  for  acquiring  sensory  data  to  solve  this  localization  problem, 
with  excellent  results. 

Our  formal  analysis  and  our  empirical  experience  both  argue  that  the  RAF  approach  to  recognition 
fails  to  adequately  deal  with  the  issue  of  segmentation  of  the  data  into  subsets  that  are  likely  to 
have  come  from  a  single  object.  While  the  Hough  transform  can  help  reduce  this  problem,  it 
is  model  driven,  and  hence  potentially  very  expensive  when  applied  to  large  libraries  of  objects. 
As  an  alternative  to  this,  David  Jacobs  has  directly  addressed  the  issue  of  generic  grouping  in 
an  image  [38].  Jacobs  has  derived  r  's  for  determining  the  probability  that  a  set  of  edge 
fragments  in  an  image  is  likely  to  have  come  from  a  single  object.  These  measures  consider  simple 
measurements  such  as  the  separation  of  groups  of  edges,  and  the  relative  alignment  of  groups  of 
edges.  The  recognition  system,  since  it  does  not  directly  consider  the  object  model,  may  occasionally 
be  incorrect.  However,  tests  of  the  system  on  a  variety  of  images  of  two-dimensional  and  three- 
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dimensional  scenes  shows  a  remarkable  and  dramatic  reduction  in  the  search  required  to  recognize 
objects  from  a  library,  and  also  is  quite  effective  at  identifying  groups  of  edges  coming  from  a  single 
object.  The  effect  of  this  grouping  mechanism  is  particularly  apparent  when  applied  to  libraries  of 
objects,  since  the  parameters  computed  by  the  grouping  scheme  can  be  used  to  do  effective  indexing 
into  a  library. 

We  have  also  continued  to  investigate  the  use  of  parallel  architectures,  such  as  the  Connection 
Machine,  to  obtain  significant  performance  improvements.  Todd  Cass  has  completed  the  develop¬ 
ment  and  implementation  of  a  parallel  recognition  scheme  for  two  dimensional  scenes,  on  which  he 
reported  in  the  1988  Proceedings  .  The  system  uses  a  careful  Hough  transform  method,  followed 
by  a  sampling  scheme  in  the  parameter  space  to  find  instances  of  an  object  and  its  pose.  Typical 
performance  of  the  method  involves  the  correct  identification  and  localization  of  heavily  occluded 
objects,  in  scenes  in  which  a  large  number  of  other  parts  are  present,  in  under  five  seconds,  using 
a  16K  processor  configuration  of  the  Connection  Machine.  More  recent  work  -  mentioned  earlier  - 
has  focused  on  integrating  this  recognition  method  with  data  provided  by  the  Vision  Machine. 
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